composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-19 02:01:01 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	ce4525e82b	[CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555 ) ## Motivation Fix kbatch check in grouped conv and gemm kernels, allow tails for kbatch. ## Technical Details Round up K / Kperxdl and divide it by Kbatch to allow tail for K. ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-21 23:55:24 +01:00
Emily Martins	59e6fc67d8	[CK_TILE] Prune Stream-K Tile Engine Tests (#5625 ) ## Motivation Stream-K tile engine tests are causing issues for build time. While we work on a more permanent solution, these changes prune the Stream-K test instances to help reduce the build time burden. ## Technical Details The Stream-K team recently transitioned to using CK Tile's tile engine infrastructure for our smoke tests. However, since tile engine creates an individual target per kernel instance, we've found that the tile engine tests are increasing build times. Our team is currently working to convert our existing tile engine tests back to basic gtests. While this work takes place, we are temporarily pruning the existing Stream-K tile engine test instances to help reduce the build time burden. ## Test Plan Ran the pruned test set on all gfx90a, gfx942, and gfx950. ## Test Result All tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 13:30:34 -07:00
andrew clark	ba959369e3	Improved CI infrastructure failure detection (#5464 ) ## Motivation This PR re-enables CI infrastructure failure detection and notification, which had been disabled due to performance issues caused by loading large build logs (~80k lines) into memory for pattern scanning. The goal is to reliably detect known infrastructure failures (GPU errors, Docker authentication issues, disk space errors, etc.) and send actionable Teams notifications without hanging on large logs. ## Technical Details - Replaced full build log loading and Groovy-based pattern scanning with a streaming wget \| grep -E pipe. grep scans natively so the full log is never loaded into Groovy, resolving the hang on large logs. - Combined all failure patterns into a single grep -E call to avoid multiple log fetches. - The node name is now tracked with the observed failure. - Added a new failure pattern for device's running out of space. ## Test Plan - Forced failures in the "Determine CI Execution" stage with all 9 failure patterns echoed to the build log. - Simulated large log sizes (~80k lines of dummy output) to validate pattern detection and node name extraction at realistic log scales, including patterns placed both before and after large blocks of dummy output. ## Test Result All 9 failure patterns detected correctly. Teams notifications sent with accurate log context, node name, and job links. No hangs observed on 80k line simulated logs. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 19:16:58 +00:00
Jobbins	8892f83890	add self healing to ref repo (#5630 ) ## Motivation Check for when mirror repo gets corrupted in CI ## Technical Details We detect broken ref objects and rebuild the local mirror in that case of corruption ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 09:42:53 -07:00
Bartłomiej Kocot	5db7b220d5	[CK][CK Tile] Improve access for merged groups and remove modulo from xor (#5334 ) ## Motivation [CK][CK Tile] Improve access for merged groups and remove modulo from xor ## Technical Details - add template parameter to xor if modulo is needed. We don't need modulo for merged groups - use access by m for merged groups for a tensor - ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 15:45:45 +00:00
Kiefer van Teutem	e8f9bb0c19	[CK_Tile] Refactor amdgcn_mma policy structs (#5272 ) ## Motivation The point of this MR is to update the intrinsic layout parameters to simplify them and make them more clear and flexible. Also, a number of simple refactors were performed to reduce boilerplate and code duplication. ## Technical Details In CK Tile and old CK, the full set of information available in the intrinsic wrappers, for WMMA and MFMA combined, would be something like: ``` // Basic info using ADataType = void; using BDataType = void; using CDataType = void; using AVecType = ext_vector_t<ADataType, 0>; using BVecType = ext_vector_t<BDataType, 0>; using CVecType = ext_vector_t<CDataType, 0>; // Fragment sizes static constexpr index_t kM; static constexpr index_t kN; static constexpr index_t kK; // Layout parameters static constexpr index_t kAMBlock; static constexpr index_t kBNBlock; static constexpr index_t kRepeat; static constexpr index_t kAMLane; static constexpr index_t kBNLane; static constexpr index_t kABK0PerLane; static constexpr index_t kABKLane; static constexpr index_t kABK1PerLane; static constexpr index_t kCMLane; static constexpr index_t kCNLane; static constexpr index_t kCM0PerLane; static constexpr index_t kCM1PerLane; using kABPs2RHssMajor = sequence<2, 1>; using kABPs2RHssMinor = sequence<1, 0>; using kABYs2RHsMajor = sequence<2, 2>; using kABYs2RHsMinor = sequence<0, 2>; using kCPs2RHssMajor = sequence<1, 2>; using kCPs2RHssMinor = sequence<1, 0>; using kCYs2RHsMajor = sequence<1, 1>; using kCYs2RHsMinor = sequence<0, 2>; using kCTPs2RHssMajor = sequence<2, 1>; using kCTPs2RHssMinor = sequence<1, 0>; using kCTYs2RHsMajor = sequence<2, 2>; using kCTYs2RHsMinor = sequence<0, 2>; ``` Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to: ``` // Basic info using ADataType = void; using BDataType = void; using CDataType = void; // Fragment sizes static constexpr index_t kM; static constexpr index_t kN; static constexpr index_t kK; // Layout parameters static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even for diff A / B layouts static constexpr index_t kAKNumAccess; // K2 static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and CDNA block hiding. static constexpr index_t kBKNumAccess; // K2 static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and CDNA block hiding. static constexpr index_t kCMPerLane; // M2 * M0 static constexpr index_t kCMNumAccess; // M2 // Derived properties using AVecType = ext_vector_t<ADataType, 0>; using BVecType = ext_vector_t<BDataType, 0>; using CVecType = ext_vector_t<CDataType, 0>; ``` Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features. Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency. I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp Other refactorings I did in this MR: - Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier. - Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code. - Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead. - Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately). - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 09:07:00 -06:00
Bartłomiej Kocot	0a3229ea22	[CK][CK Tile] Move grouped conv cpp instances to build dir (#5609 ) ## Motivation Move grouped conv .cpp instances to build dir. Fix generate instances script. ## Technical Details Avoid CI problem when instances in experimental directory are not removed ## Test Plan test_grouped_convnd_*_tile ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 13:19:31 +00:00
joyeamd	d2b129f56b	Ck/joye/revert oob check (#5640 ) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 12:30:08 +00:00
arai713	121fa1a503	[CK_TILE] Rename Stream-K grid function (#4795 ) ## Motivation This PR introduces a change in the name of the get_grid function in the Stream-K TilePartitioner to avoid confusion with a similarly named method. In the Stream-K TilePartitioner, there is get_grid() which returns num_cu*occupancy and there is grid_size() which returns the grid size used to launch the kernel. In this PR, we change get_grid() to be get_max_active_wgs() to better reflect what the function returns and not confuse it with grid_size(). ## Technical Details Initially in the Stream-K TilePartitioner we had get_grid() which returned grid_. We are renaming get_grid() to get_max_active_wgs() and grid_ to max_active_wgs_ internally, while keeping grid_size() the same. The parameter, grid, for the Stream-K TilePartitioner remains the same to maintain consistency with the rest of the Stream-K API. ## Test Plan Validated using the test suite that is already present. ## Test Result All tests passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 03:27:44 -06:00
yinglu	b973cc43d3	[CK]fix: remove redundant structured sparsity check in run_gemm_example.inc (#5612 ) ## Motivation This issue if found via https://github.com/ROCm/rocm-libraries/pull/4302#discussion_r2958603418 and is introduced via https://github.com/ROCm/rocm-libraries/pull/5323. ## Technical Details The outer `if` and inner `if constexpr` both checked GemmConfig::UseStructuredSparsity. Merged into a single `if constexpr` since both preshuffle and UseStructuredSparsity are compile-time constants. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 09:22:14 +01:00
Sami Remes	a699df9fdc	[CK_TILE] Enable MXFP6 for MX GEMM op (#5095 ) ## Motivation Add support for MXFP6 in the MX GEMM op in CK-Tile. Depends on https://github.com/ROCm/rocm-libraries/pull/4594 ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-19 18:07:47 -07:00
Yaswanth Raparti	a61238b69f	[CK][CK TILE] Fix smart-build to run install target for client examples (#5614 ) How ninja install works: - Builds library dependencies (device_operations, etc.) - Installs them to CMAKE_INSTALL_PREFIX - Skips building test executables (not install dependencies) Affected stages (8): - gfx942/gfx950/gfx908/gfx90a CK Client Examples - gfx10-1/gfx10-3/gfx11/gfx12 CK Client Examples ## Motivation Problem: When smart-build is enabled (runAllUnitTests=false), the build step is skipped entirely. This causes client example stages to fail because they depend on the CK library being installed to ../install. Error seen: Target "client_gemm" links to: composable_kernel::device_other_operations but the target was not found. ## Technical Details Root cause: Line 712 only checked runAllUnitTests, so when building with config_targets="install", the install target was never built, leaving the install directory empty. Fix: Added condition to always build when config_targets contains 'install'. The install target automatically builds its dependencies (the CK libraries) but skips building tests, which aligns with smart-build philosophy. ## Test Plan Should be tested on CI ## Test Result Should be tested on CI ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-03-19 22:00:29 +00:00
Bartłomiej Kocot	aa5ae1c005	[CK][CK Tile] Fix dram step for KM/KN layouts in V1 pipeline (#5470 ) ## Motivation Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile window. ## Technical Details - Fix dram step for KM/KN layouts in V1 pipeline - Disable instances which use more threads than warp size in continous dim (not supported in ck tile yet) - Use 1x1 specialization for explicit gemm - Use two stage for vectorsize =1 and sizeof(datatype) ==2 - remove not needed check sinze GetVectorSizeA/B check if vector size is fixed ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-966	2026-03-19 11:59:44 +00:00
assistant-librarian[bot]	39bc8453c6	[CK_TILE] add tf32 support (#4302 ) ## Proposed changes TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in CK_TILE on gfx942 and gfx950. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [ ] Any dependent changes have been merged ## Discussion --- 🔁 Imported from [ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538) 🧑‍💻 Originally authored by @yingluAMD --------- Co-authored-by: yingluAMD <Yingmao.Lu@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-19 10:17:20 +01:00
Yaswanth Raparti	1333922f04	[CK] [CK_TILE] Improve build and test time of CI with smart dependency parser (#5249 ) ## Motivation Existing dependency parser needs full build of tests to determine which tests are affected by code changes in a PR. This still takes 2-4 hours for building the tests which slows down the CI as the number of tests grow. To resolve this issue we implemented a smart dependency parser which uses CMake Configure to parse dependencies and build only the affected test cases. We have ensured that two approaches are available 1) CMake pre-build analysis for each PR to ensure fast build and test. 2) Ninja post-build analysis to enable full build for nightly tests. ## Technical Details ```bash ### 1. Configure the project with CMake cmake -G Ninja -DCMAKE_EXPORT_COMPILE_COMMANDS=ON .. ### 2. Analyze dependencies (no build required!) python3 ../script/dependency-parser/main.py cmake-parse compile_commands.json build.ninja \ --workspace-root .. --output cmake_dependency_mapping.json --parallel 8 ### 3. Find tests affected by changes python3 ../script/dependency-parser/main.py select cmake_dependency_mapping.json origin/develop \ HEAD --test-prefix --output tests_to_run.json ### 4. Build only affected tests ninja $(jq -r '.executables[]' tests_to_run.json \| tr '\n' ' ') ### 5. Run affected tests ctest -R "$(jq -r '.regex' tests_to_run.json)" ``` ### Jenkins Integration - Added `buildMode` to jenkinsfile to integrate both `selective` and `full` build methods ### Known Limitations ### 1. Build-Time Generated Headers (HIGH RISK) Problem: Files generated during the build process (e.g., via `add_custom_command`) cannot be analyzed before building. Example: ```cmake add_custom_command( OUTPUT ${CMAKE_BINARY_DIR}/generated/config.hpp COMMAND generate_config.sh DEPENDS template.hpp.in ) ``` Impact: If a source file includes `generated/config.hpp`, the dependency won't be detected until after building. Mitigation: - CK analysis shows no generated headers currently used - If generated headers are added in the future, they must be built first - Recommendation: Generate headers in CMake configure phase (not build phase) when possible ## Test Plan 1. Modified Files: ``` include/ck_tile/ops/common.hpp include/ck_tile/ops/gemm.hpp include/ck_tile/ops/gemm/warp/warp_gemm.hpp ``` 2. Compare tests selected between `build.ninja` and `cmake-parse` methods ## Test Result - 1. The test completed in 5-6 minutes finding about 8000+ executables that should be built. - 2. We selected a commit 5ccc1387ea which resulted in same 7 tests with both legacy and new methods. - PR \| Legacy tests \| Smart tests \| Notes -- \| -- \| -- \| -- 5261 \| 453 \| 455 \| Only 2 tests (test_amdgcn_mma and test_amdgcn_sparse_mma) 5168 \| 0 \| 0 \| Changes in dispatcher only. No CK tests invoked. 5249 \| 0 \| 0 \| Changes to dependency parser. No CK tests invoked 5260 \| 0 \| 0 \| Changes in dispatcher only. No CK tests invoked. 5174 \| 1 \| 1 \| One test from FMHA affected by this PR in both cases 5383 \| 0 \| 0 \| Changes are only in benchmark files. Did not trigger any tests 5445 \| 1 \| 1 \| Changes are only to tests/ck_tile/gemm_streamk. Only triggered one streamk test in both cases. 5454 \| 3 \| 3 \| Both methods identified same test_grouped_conv_bwd tests 5427 \| 234 \| 234 \| Core infrastructure header changes. Detected exactly same tests 5388 \| 85 \| 85 \| modifies warp-level GEMM operations (warp_gemm.hpp, warp_gemm_dispatcher.hpp). Correctly identified all the streamK gemm tests ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-19 05:30:19 +00:00
lalala-sh	b0df41c187	[CK] Fix MOE FP8 SplitK buffer descriptor OOB (#5086 ) When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces are still based on full K dimension. This causes hardware buffer resource descriptors to extend beyond the actual tensor allocation, leading to GPU memory access faults when the tensor happens to be placed at the end of an allocated memory pool region. Fix by subtracting the split offset from each buffer's element space in both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer descriptor range [shifted_base, shifted_base + reduced_space) exactly covers the valid allocation. Also refactor SplitKBatchOffset to accept const Problem& (instead of Argument&) and add a default constructor, enabling direct reuse in Run/Run_2Lds without duplicating offset calculation logic. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yi DING <yi.ding@amd.com>	2026-03-19 10:41:53 +08:00
Christopher Millette	4fdcfab53d	[CK] Replace nested static_for with static_ford to reduce device IR function emissions [1B] (#5031 ) ## Summary ### Rationale CK's GPU kernels are among the slowest files in the ROCm build, with a single translation unit taking up to 10+ minutes. Profiling with `-ftime-trace` identified nested `static_for` loops as the root cause: each nesting level multiplies the number of unique lambda IR functions the compiler must process. A 2-level nest of `static_for<0, M, 1>` / `static_for<0, N, 1>` produces M×N unique lambda types. With typical GEMM dimensions (M=16, N=4), a single nest generates 64 unique functions — and these nests appear hundreds of times across the codebase. The LLVM backend's CGSCC (Call Graph Strongly Connected Components) framework processes each function independently, so reducing function count directly reduces backend time. ### What changed 393 nested compile-time loop patterns across 73 files are converted to `static_ford`, which flattens multi-dimensional compile-time iteration into a single `static_for` with index decomposition. This eliminates 994 `static_for` nesting levels (42% reduction). Three pattern categories were converted: - Category A: `static_for` wrapping `static_ford` — fold outer dimension into ford - Category B: nested `static_ford` — merge into single higher-dimensional ford - Category C: nested `static_for` chains — convert to single `static_ford` ### Verification ASM equivalence: PASS — 51/51 device assembly files identical (gfx942 + gfx1100) \| Architecture \| Files compared \| Largest file \| Result \| \|---\|---\|---\|---\| \| gfx942 \| 36 \| 386,685 lines \| ALL MATCH \| \| gfx1100 \| 15 \| 47,769 lines \| ALL MATCH \| Build time (Wilcoxon signed-rank test, 7 paired trials): \| Target \| Pre (s) \| Post (s) \| Delta \| p-value \| \|---\|---\|---\|---\|---\| \| bscale \| 169 \| 152 \| -9.8% \| 0.016 \* \| \| xdl_v1234 \| 207 \| 194 \| -6.6% \| 0.016 \* \| \| preshuffle \| 275 \| 264 \| -3.9% \| 0.016 \* \| \| xdl_base \| 142 \| 137 \| -3.2% \| 0.031 \* \| IR function counts (device backend, gfx942): \| Target \| InstFunc Δ \| CodeGen Δ \| Compiler Δ \| \|---\|---\|---\|---\| \| bscale \| -13,043 (-8.2%) \| -2,103 (-3.5%) \| -10.7% \| \| xdl_v1234 \| -9,431 (-5.7%) \| +59 (+0.1%) \| -5.2% \| \| xdl_base \| -6,162 (-4.9%) \| -1,141 (-2.5%) \| -2.2% \| \| xdl_old \| -3,234 (-3.7%) \| -963 (-8.7%) \| -3.3% \| ### Value - 994 fewer `static_for` nesting levels (-42%) across 73 files - 393 `static_ford` sites created (from 4 pre-existing) - Up to 9.8% compile-time reduction on representative targets (statistically significant, p < 0.05) - Up to 13K fewer IR function instantiations per translation unit - Net -849 LOC from reduced indentation - Zero ASM changes — identical device code output verified on gfx942 and gfx1100 - All scheduling barriers, `if constexpr` guards, and MFMA/WMMA accumulation order preserved ### Files changed (73) - `block/`: 47 files (GEMM pipelines — xdlops, wmma, moe, preshuffle, blockscale variants) - `grid/`: 20 files (softmax, normalization, reduction, attention, layernorm) - `thread/`: 5 files (tensor slice transfer, contraction, GEMM dlops, reduction) - `tensor_description/`: 1 file (tensor_adaptor) ## Test plan - [x] `static_ford` tested with 21 unit tests in `test/util/unit_ford.cpp` (1D-4D, custom orders, compile-time verification) - [x] All conversions preserve iteration order, `block_sync_lds()` placement, `if constexpr` scheduling guards, and MFMA/WMMA accumulation order - [x] ASM equivalence verified: 51 device `.s` files across gfx942 + gfx1100 - [x] Build-time improvement statistically confirmed (Wilcoxon, p < 0.05, 4 targets) - [x] IR function count reduction confirmed via `-ftime-trace` on 7 targets - [x] Detection script reports 0 remaining safe patterns (180 blocked with structural reasons) - [x] Existing CI tests (GEMM, softmax, normalization, batch norm, reduction, attention) exercise all converted code paths ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 08:45:22 -06:00
Thomas Ning	75357722b8	CK Tile MX GEMM Packing Improvement (#5323 ) ## Motivation Reduce the scale loading size and also has better utilization of MFMA scale selection. ## Technical Details Add up the packing of mx scales. ## Test Plan Use the existing test cases. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: Enrico Degregori <enrico@streamhpc.com>	2026-03-17 11:57:32 -07:00
Hosang	b5894b3cbe	[CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018 ) ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 56.27 \| 61.48 \| 1.09x 4096 \| 67.10 \| 72.27 \| 1.08x 8192 \| 65.99 \| 71.64 \| 1.09x 12288 \| 61.60 \| 76.61 \| 1.24x 16384 \| 58.99 \| 75.74 \| 1.28x 20480 \| 57.32 \| 74.42 \| 1.30x 24576 \| 56.89 \| 74.25 \| 1.31x 27280 \| 18.93 \| 24.48 \| 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 66.79 \| 65.90 \| 0.99x 4096 \| 85.90 \| 86.80 \| 1.01x 8192 \| 77.06 \| 90.29 \| 1.17x 12288 \| 58.36 \| 88.98 \| 1.52x 16384 \| 52.12 \| 88.88 \| 1.71x 20480 \| 48.11 \| 88.42 \| 1.84x 24576 \| 47.12 \| 89.07 \| 1.89x 27280 \| 49.05 \| 50.31 \| 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 21:18:34 +00:00
Bartłomiej Kocot	1e1f3647f7	[CK][CK Tile] Grouped Convolution backward weight profiler flush cache (#5454 ) ## Motivation Flush cache to get more stable results during profiling old ck and ck tile. ## Technical Details Flush cache before each kernel call and one more first run. ## Test Plan test_grouped_conv_bwd_weight_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-966 --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 17:46:21 +00:00
lalala-sh	4f091cacd0	[CK] fix moe memset size which is bigger than alloc (#5225 ) ## Motivation Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that crashes split-K MOE GEMM with "HIP runtime error: invalid argument". When KBatch > 1, the invoker zeroes the output buffer using arg.M * arg.N as the byte count. However, arg.M is the padded sorted-token-id length from MOE routing, which can be much larger than the actual output allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write beyond the buffer, and the silently-swallowed HIP error propagates to the subsequent kernel launch via hipGetLastError(). This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset matches the actual output size. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 17:30:07 +08:00
Enrico Degregori	a9a8a57786	[CK Tile] Eight Waves pipeline GEMM (#4964 ) ## Motivation Eight waves pipeline was added for ABQuant. The goal of this PR is to enable it also for GEMM ## Technical Details Summary: - Block: - Create block struct for GEMM using eight warps specific distribution encodings - Use this block struct in ABQuant for encodings - Pipeline: - Create impl pipeline for eight waves which can be used by GEMM and ABQuant as base (and for AQuant and BQuant in the future) - Create eight waves pipeline for GEMM (this can not be easily integrated in the existing async pipeline) - Pipeline policy: - Extract GEMM specific parts in the ABQuant policy to define GEMM policy (then ABQuant use it as base and add Quant specific methods) - Minor: naming was inconsistent between warp/wave, everything is now referred to as eight waves So overall we have: - block struct directly used by GEMM -> ABQuant derived struct to implement operator - Impl base pipeline with general implementation -> GEMM and ABQuant pipelines use it to avoid code duplication but still define their own pipelines - pipeline policy struct directly used by GEMM -> ABQuant derived policy struct for Quant specific parts ## Test Plan Added new tests for GEMM pipeline: `test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950 supports it). Note: K padding test is disabled for this pipeline because it's not implemented yet ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 09:30:54 +01:00
Bartłomiej Kocot	6373f5df7b	[CK][CK Tile] Grouped Convolution Backward Weight set of fixes (#5387 ) ## Motivation Grouped Convolution Backward Weight split k fixes for CK tile kernels ## Technical Details - get k batch from kargs to get deduced k batch - multiply zeroing size by data type size - disable v6 (producing a incorrect results) ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result Pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Ville Pietilä <>	2026-03-13 10:18:19 -06:00
Yi DING	f6bfcad437	[CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174 ) ## Motivation This PR enables a persistent-kernel execution path for FMHA backward (dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is split, stored, and converted back to final gradients. ## Technical Details - Introduces a persistent-kernel grid mapping in deterministic mode and updates split-count calculation accordingly. - Extends kernel kargs to carry batch-related info needed for persistent scheduling and dQ conversion. - Refactors dQ store conditions and adds mask-type traits/utilities and runner logging updates. ## Test Plan - Jenkins [base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline) - Jenkins [AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline) - Jenkins [FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline) - local FA tests ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-13 14:13:32 +08:00
Ville Pietilä	5d2cbd1117	[CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK Tile profiler (#5237 ) ## Motivation PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The two-stage kernels were not supported in the initial PR. This PR adds the the missing bwd weight two-stage kernels to the CK Profiler. ## Technical Details Extended the CK Tile conv builder factory to build also the elementwise ops required for the two-stage kernels. Extended the CK Builder for CK Tile instance to accept the two-stage flag as part of the algorithm configuration. ## Test Plan Added units tests for CK Builder that verify the two-stage kernel construction. ## Test Result If CI passes, the added unit tests are passing. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Ville Pietilä <>	2026-03-12 19:20:15 -06:00
dependabot[bot]	3fe871bb0c	Bump tornado from 6.5.4 to 6.5.5 in /projects/composablekernel/docs/sphinx (#5376 ) Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.4 to 6.5.5. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst">tornado's changelog</a>.</em></p> <blockquote> <h1>Release notes</h1> <p>.. toctree:: :maxdepth: 2</p> <p>releases/v6.5.5 releases/v6.5.4 releases/v6.5.3 releases/v6.5.2 releases/v6.5.1 releases/v6.5.0 releases/v6.4.2 releases/v6.4.1 releases/v6.4.0 releases/v6.3.3 releases/v6.3.2 releases/v6.3.1 releases/v6.3.0 releases/v6.2.0 releases/v6.1.0 releases/v6.0.4 releases/v6.0.3 releases/v6.0.2 releases/v6.0.1 releases/v6.0.0 releases/v5.1.1 releases/v5.1.0 releases/v5.0.2 releases/v5.0.1 releases/v5.0.0 releases/v4.5.3 releases/v4.5.2 releases/v4.5.1 releases/v4.5.0 releases/v4.4.3 releases/v4.4.2 releases/v4.4.1 releases/v4.4.0 releases/v4.3.0 releases/v4.2.1 releases/v4.2.0 releases/v4.1.0 releases/v4.0.2 releases/v4.0.1 releases/v4.0.0 releases/v3.2.2 releases/v3.2.1 releases/v3.2.0 releases/v3.1.1</p> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`7d6465056c`"><code>7d64650</code></a> Merge pull request <a href="https://redirect.github.com/tornadoweb/tornado/issues/3586">#3586</a> from bdarnell/update-cibw</li> <li><a href="`d05d59b808`"><code>d05d59b</code></a> build: Bump cibuildwheel to 3.4.0</li> <li><a href="`c2f46732b0`"><code>c2f4673</code></a> Merge pull request <a href="https://redirect.github.com/tornadoweb/tornado/issues/3585">#3585</a> from bdarnell/release-655</li> <li><a href="`e5f1aa4b6f`"><code>e5f1aa4</code></a> Release notes and version bump for v6.5.5</li> <li><a href="`78a046f99f`"><code>78a046f</code></a> httputil: Add CRLF to _FORBIDDEN_HEADER_CHARS_RE</li> <li><a href="`24a2d96ea1`"><code>24a2d96</code></a> web: Validate characters in all cookie attributes.</li> <li><a href="`119a195e29`"><code>119a195</code></a> httputil: Add limits on multipart form data parsing</li> <li>See full diff in <a href="https://github.com/tornadoweb/tornado/compare/v6.5.4...v6.5.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tornado&package-manager=pip&previous-version=6.5.4&new-version=6.5.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ROCm/rocm-libraries/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-12 14:22:13 -04:00
Adam Osewski	7a12ecc762	[CK TILE] Skip work if any of Grouped GEMM groups M/N/K are zero. (#5050 ) ## Motivation It's common in MoE workloads that some experts receive zero tokens, which would result in some of the dimensions equal to zero. Currently we handle such case only for non-persistent kernels where we have all GEMMs information beforehand on host - we validate this during creation of kernel arguments. However for the "dynamic" input path (persistent kernel) this information is not available before kernel launch. Thus we have to validate this during kernel execution. The goal is to add this validation. ## Technical Details Skip work if any of Grouped GEMM groups M/N/K are zero for persistent kernel path. ## Test Plan Add unit-tests which cover "dynamic" inputs with zero dims for persistent kernel execution path. ## Test Result All tests pass. ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 13:28:24 +00:00
Michal Kulikowski	3c2c294ffb	[CK][Test] Moving device_op creation before data initialization. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2026-03-12 09:47:56 +01:00
Michal Kulikowski	29c4f868ef	[CK][Examples] Adding parameters for a couple of CK examples: -gemm_add_add_mean_meansquare_xdl_fp16 -gemm_dl_quantization_int8 -gemm_xdl_bias_relu_quantization_int8 -gemm_xdl_quantization_int8 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2026-03-12 09:47:41 +01:00
chris-tsiaousis-hpc	9c1c4c9168	Changed the include order of the new WMMA/MFMA unification framework (#5241 ) Those changes are to fix the include order and make header files independent of one another. Also the `remod.py` sript has run and changed the `grouped_convolution.hpp` and `core.hpp` files. ## Motivation Some headers appear to depend on include order. For example, when moving `#include "wmma/wmma.hpp"` in [amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp) later in the include list, it is causing compilation errors. Also the pre-commit script `remod.py` is shuffling includes to be in alphabetical order and is causing compilation issues. Expected behaviour: Headers should be independent of one another: no header should require another to be included first. Each header should compile correctly on its own. ## Test Plan The CI (that runs `remod.py`) should compile. ## Test Result Existing CI should compile and be green. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-03-12 09:26:58 +01:00
Aviral Goel	a124535133	ck_tile: add gtest unit tests for MX flatmm (gfx950) (#5082 ) ## Summary - Add correctness unit tests for the MX-format flatmm kernel (`example/ck_tile/18_flatmm/mxgemm`) under `test/ck_tile/flatmm/` - Tests cover all five dtype combinations: FP4×FP4, FP8×FP8, FP6×FP6, FP8×FP4, FP4×FP8 - Tests cover all four kernel dispatch paths (the `has_hot_loop` × `tail_num` product): - `has_hot_loop=false, tail=ODD` (K=256, num_loop=1) - `has_hot_loop=false, tail=EVEN` (K=512, num_loop=2) - `has_hot_loop=true, tail=ODD` (K=768, num_loop=3) - `has_hot_loop=true, tail=EVEN` (K=1024, num_loop=4) - Remove unsupported `-split_k` CLI option from `tile_example_mx_flatmm`; the pre-shuffled B layout is incompatible with K-splitting and the option silently produced wrong results ## Changes New files (`test/ck_tile/flatmm/`): - `CMakeLists.txt` — builds 40 kernel instances as a shared OBJECT library, links into 5 per-dtype test executables; forwards `-DCK_TILE_USE_OCP_FP8` when `CK_USE_OCP_FP8` is ON - `test_mx_flatmm_base.hpp` — base test fixture with `run_test_with_validation(M, N, K, kbatch=1)` - `test_mx_flatmm_fixtures.hpp` — concrete `TestMXFlatmm` typed test class and type aliases - `test_mx_flatmm_fp{4fp4,8fp8,6fp6,8fp4,4fp8}.cpp` — per-dtype `TYPED_TEST_SUITE` files Modified files: - `example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp` — moved `preShuffleWeight` here (was in `mx_flatmm.cpp`) so it is includeable by both the example and the tests - `example/ck_tile/18_flatmm/mxgemm/mx_flatmm.cpp` / `run_mx_flatmm.inc` — removed `-split_k` CLI arg, hardcoded `k_batch=1`, fixed `k_split` formula, updated call sites after `preShuffleWeight` move - `test/ck_tile/CMakeLists.txt` — added `add_subdirectory(flatmm)` --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-03-11 15:46:58 -07:00
Bartłomiej Kocot	86ac3213ed	[CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114 ) ## Motivation Improve profiling for grouped convolution forward for better comparison between CK and CK Tile ## Technical Details - Include preprocessing time for ck tile - Add flush cache for conv fwd profiler - Switch configs to builder reflect - Add KPerXdl deduce - Add non-grouped ported instances ## Test Plan test_grouped_convnd_fwd_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-786	2026-03-11 23:38:15 +01:00
Emily Martins	ddf49459ba	[CK_TILE] Add the GEMM Memory pipeline to Stream-K tests (#5242 ) ## Motivation We want to extend our Stream-K coverage to include other GEMM pipeline since our current tests only test the CompV3 pipeline. ## Technical Details All Stream-K unit tests currently only tests one pipeline: CompV3. These changes extend the test support to also test the Memory pipeline. Future work will add support for additional GEMM pipelines. The major changes are as follows: - Remove of fp8 and bf8 extended tests for gfx90a: gfx90a does not have native support for fp8 and bf8 and emulate the behavior with fp32 mfma instruction sizes. We've observed extremely long compile times for fp8 and bf8 on gfx90a (exceeding 15 minutes), hence we've opted to disable these tests. - Add the memory pipeline to the Stream-K tile engine tests: Now our smoke tests covers compv3 and memory pipelines. - Add the memory pipeline to the Stream-K extended tests: These changes modify the test kernel types to include the appropriate pipeline. Each pipeline is contained within a separate kernel type to help avoid large increases in build time. ## Test Plan - Ran existing and added tests on all architectures. ## Test Result - All local tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-11 15:09:17 -06:00
Christopher Millette	4947624915	[CK_TILE] Optimize ck_tile::sequence to reduce template instantiation depth [2A] (#5028 ) ## Summary ### Rationale `ck_tile::sequence` is the most fundamental metaprogramming type in ck_tile — it underpins tensor dimensions, strides, loop bounds, and index calculations. Six of its metafunctions use recursive template instantiation, producing O(N) to O(N²) intermediate types that the compiler must process. When these are used inside deeply nested GEMM pipelines with large dimension counts, the cumulative instantiation overhead becomes a significant contributor to frontend compile time. Measurements on `test_gemm_pipeline_compv6` show 84,288 `InstantiateFunction` calls in the frontend alone. Reducing template instantiation depth in these core utilities has a multiplicative effect because they are called from hundreds of sites. ### What changed \| Metafunction \| Before \| After \| \|---\|---\|---\| \| `sequence::modify` \| O(N) recursive split/merge \| O(1) pack expansion \| \| `sequence_gen` \| O(log N) recursive binary split \| O(1) via `__make_integer_seq` \| \| `uniform_sequence_gen` \| Delegates to `sequence_gen` \| O(1) via `__make_integer_seq` \| \| `sequence_reverse_inclusive_scan` \| O(N) recursive \| O(1) constexpr for-loop + pack expansion \| \| `sequence_inclusive_scan` \| Computed via reverse + flip \| O(1) constexpr for-loop (unified impl) \| \| `sequence_exclusive_scan` \| O(N) recursive merge chain \| O(1) constexpr for-loop + pack expansion \| \| `sequence_map_inverse` \| O(N²) recursive modify calls \| O(1) constexpr for-loop + pack expansion \| Supporting changes: - Portable `__type_pack_element` fallback with `__has_builtin` guard (hipRTC-safe, no `<tuple>` dependency) - Renamed reserved `__integer_sequence` to `integer_sequence_wrapper` - Adopted `static_array` from develop (PR #4355) for constexpr computation - Unified forward and reverse inclusive scan into a single `sequence_inclusive_scan_impl` with `bool Reverse` template parameter - Added `sequence_inclusive_scan` struct (new public API for forward scan direction) - Replaced recursive `sequence_exclusive_scan` (3 template specializations) with `sequence_exclusive_scan_impl` using the same constexpr for-loop pattern as inclusive scan - Rewired `exclusive_scan_sequence` and `prefix_sum_sequence` to use new impl - Added `CK_TILE_HOST_DEVICE` to `exclusive_scan_sequence` and `prefix_sum_sequence` to match sibling scan function annotations ### Technical debt and housekeeping - Unified all `namespace impl` to `namespace detail` across sequence.hpp for consistency - Removed dead comment block (orphaned `integer_sequence` alternative) - Added defensive `static_assert(sizeof...(Is) > 0)` in `sequence_map_inverse::build_inverse` - Converted all multi-line Doxygen blocks from `///` to `/** /` per style guide - Corrected `constexpr static` to `static constexpr` keyword ordering in `static_array` - Added blank line between `#pragma once` and first `#include` in `static_array.hpp` - Trimmed redundant 4-line comment on `sequence_gen_helper` to a one-liner - Moved `sequence_gen` Doxygen comment below `namespace detail` block so it directly precedes the struct it documents - Added Doxygen `@brief`/`@tparam`/`@pre` documentation for `sequence_gen` and `sequence_map_inverse` public APIs - Added `@brief` documentation to `static_array` explaining relationship to `ck_tile::array` - Added scope comment at `namespace detail` openings Note:* `private:`/`public:` access modifier indentation is enforced at 4 spaces by `.clang-format`. The style guide calls for left-alignment, but the formatter overrides this. Requires a `.clang-format` config change to resolve — not addressable in code. ### `static_array` hardening (from develop's PR #4355) - Added zero-length array guard (`T elems[N > 0 ? N : 1]`) - Added `CK_TILE_HOST_DEVICE` annotations to `operator[]` and `size()` - Added `#include "ck_tile/core/config.hpp"` (IWYU for `CK_TILE_HOST_DEVICE`) ### Value Combined with the `static_ford` changes, measured impact on `test_gemm_pipeline_compv6`: - Frontend: -28.9% (InstantiateFunction: 84,288 → 69,439) - Backend: -13.1% (CodeGen Functions: 3,170 → 2,203) - Wall-clock: -16.3% (611.6s → 512.2s) ### Files changed (4) - `sequence.hpp`: Metafunction optimizations, namespace unification, documentation, style fixes - `static_array.hpp`: Zero-length guard, `CK_TILE_HOST_DEVICE`, documentation, style fixes - `test_sequence.cpp`: 50 unit tests with runtime `EXPECT_EQ` assertions (new file) - `CMakeLists.txt`: Register new test target ## Test plan - [x] 50 runtime unit tests covering all optimized and pre-existing sequence APIs - [x] Edge cases: empty sequences, single-element, larger sizes (N=8), negative values, non-trivial init values - [x] Both functor signatures tested (`operator()(index_t)` and `operator()(number<I>)`) - [x] Both scan reducers (`plus`, `multiplies`) with forward, reverse, inclusive, and exclusive directions - [x] Exclusive scan: sum, product, single, empty, non-zero init - [x] Prefix sum: N+1 output verification, single, empty - [x] Permutation round-trip verification for `sequence_map_inverse` - [x] Full sequence public API coverage: modify, gen, uniform_gen, scans (inclusive, exclusive, prefix sum), map_inverse, make_index_sequence, size/sum/product, push/pop, reverse, extract, merge, arithmetic operators, equality, transform - [x] Portable `__type_pack_element` fallback tested implicitly (same `at_index_t` interface) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-03-11 14:24:54 -06:00
JP-Fernando	2c850cd693	[CK] Unify the grouped convolution gridwise Run() functions (#4421 ) ## Motivation There are currently three different grouped convolution related Run() function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`. These are used for the different types of grouped convolution: Forward, Backward weights, and Backward data. The functions are very similar and should be unified to a single `Run()` function for all types of grouped convolution. ## Technical Details The three old `Run<>()` functions were replaced with a single unified function. The new `Run<>()` function is run from device implementations: - DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 - DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3 - DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor implementation uses a different `Run<>()` overload and was therefore not modified. ## Test Plan Run the following grouped convolution tests on `gfx1201`, as this architecture is WMMA-capable: - `test_grouped_convnd_fwd` - `test_grouped_convnd_bwd_weight` - `test_grouped_convnd_bwd_data` Compilation and testing were also executed on `gfx1100` to avoid CI problems. ## Test Result First part (unification of `Run<>()` function): All tests successful. Second part (integration of single `Run<>()` function as a direct call): All tests successful. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-03-11 17:38:55 +01:00
JP-Fernando	5d4107862b	[CK] Add BF16^3 support to grouped conv bwd weight: bilinear and scale (#4591 ) ## Motivation Until now, XDL grouped conv bwd weight for bilinear and scale only supported bf16f32bf16. Therefore, bf16bf16bf16 support should be added. ## Technical Details Instances were added to the relevant files in `library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_bwd_weight/` folder. In addition, `add()` functions were included in new files in `library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/xdl/` and `library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_scale/xdl/` folders. The new .cpp files were also included in the `CMakeFiles.txt` files of both folders. ## Test Plan Execute `grouped_convnd_bwd_weight` tests to check execution on different architectures. The tests for bilinear and scale already include the tuple `std::tuple<ck::half_t, ck::half_t, ck::half_t, ck::Number<3>>`, so in principle, there is nothing to modify in the tests themselves. ## Test Result `gfx1201`: Tests passed. `gfx1100`: Tests passed. `gfx90a`: Tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-03-11 13:05:44 +01:00
Anton Gorenko	25d9fdfc16	[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368 ) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-11 09:59:50 +00:00
Thrupti Raj Lakshmana Gowda	791ebfb517	[CK TILE ENGINE] Add grouped_gemm operator to Tile Engine (gfx942/gfx950) (#4996 ) ## Motivation The grouped_gemm CK Tile kernel exists (e.g., `example/17_grouped_gemm/`) but has no Tile Engine wrapper. Grouped GEMM handles multiple independent GEMM problems with varying M/N/K dimensions in a single kernel launch. This PR adds the Tile Engine infrastructure for automated kernel generation, benchmarking, and profiling of grouped GEMM kernels. Jira: AICK-809 ## Technical Details - Created Tile Engine wrapper under `tile_engine/ops/gemm/grouped_gemm/` following the `gemm_universal` template - Files added: `CMakeLists.txt`, `grouped_gemm_common.hpp`, `grouped_gemm_benchmark.hpp`, `grouped_gemm_profiler.hpp`, `grouped_gemm_benchmark.py`, `grouped_gemm_benchmark_single.cpp`, `grouped_gemm_instance_builder.py`, `configs/` - Supported datatypes: fp16, fp8, bf16, bf8 - Supported layouts: rcr, rrr, ccr, crr - Target GPUs: gfx942, gfx950 - CK Tile kernel: `ck_tile::GroupedGemmKernel` from `include/ck_tile/ops/gemm/kernel/grouped_gemm_kernel.hpp` - Instance builder extends `GemmKernelBuilder` base class - Registered in `tile_engine/ops/gemm/CMakeLists.txt` - Updated Jenkinsfile to build and benchmark grouped_gemm targets in CI - Benchmark infrastructure includes JSON output, CSV export, and verification support ## Test Plan - CMake configure succeeds for grouped_gemm targets - Kernel instance builder generates valid kernel headers for all (datatype, layout) combinations - At least one kernel binary compiles and runs per datatype/layout combination - Correctness passes with `--verify 1` on gfx942/gfx950 ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-03-10 18:58:37 -05:00
John Shumway	2b68a9baf5	[CK_BUILDER] Add DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284 ) Add factory, InstanceTraits, and conv traits support for the WMMA V3 forward convolution kernel, enabling the CK Builder to generate and dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs. ## Motivation As reported in issue #4944, MIOpen includes WMMA V3 forward convolution kernels, so this PR adds support for those kernels similarly to other supported kernels. ## Technical Details This follows the same implementation as the other kernels. I added some support for reflection, but I left a few todos since we need to generalize our convolution traits to generalize across WMMA/MFMA and CK/CKTile. ## Test Plan Added faster tests to `ninja smoke-builder` that check the instance-traits logic, and I added longer tests that instantiate kernels, following the existing pattern in other kernals. ## Test Result I tested all code with `ninja check-builder` on a gfx1101 build and ran on gfx1101. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 16:41:51 -07:00
Brock Hargreaves	83bbb96890	[CK] Fix warp tile combination selection in absence of a GPU (#5213 ) ## Motivation The `get_gpu_name_by_id()` function in `gemm_streamk_validation_utils.py` relies on `rocminfo` to detect the GPU architecture at runtime. However, __`rocminfo` fails in CI/build environments__ where: - No physical GPU is present - ROCm tools are not installed - The build is running in a container without GPU access In any of these environments, the problem manifests itself in incorrect kernel validation and will generate template instantiations that do not exist: ``` [composable_kernel] FAILED: test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o [composable_kernel] /__w/TheRock/TheRock/build/core/clr/dist/lib/llvm/bin/clang++ -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_ENABLE_TF32 -DCK_TILE_USE_WMMA=0 -DCK_TIME_KERNEL=1 -DCK_USE_FNUZ_FP8 -DCK_USE_GFX94 -DCK_USE_XDL -DDPP_KERNELS -DGEMM_SINGLE_INSTANCE_HPP=\"/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp\" -DGEMM_TEST_PARAMS_HPP=\"/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/test_params.hpp\" -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/profiler/include -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/library/include -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include -I/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/include -I/__w/TheRock/TheRock/build/profiler/rocprofiler-sdk/stage/include -I/__w/TheRock/TheRock/build/profiler/roctracer/stage/include -I/__w/TheRock/TheRock/build/base/half/stage/include -I/__w/TheRock/TheRock/build/third-party/sysdeps/linux/libdrm/build/stage/lib/rocm_sysdeps/include -isystem /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx942 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -Wno-undefined-func-template -Wno-float-equal --offload-compress -include /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp -MD -MT test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o -MF test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o.d -o test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o -x hip -c /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/gemm_streamk_tile_engine/test_gemm_streamk_simple.cpp [composable_kernel] In file included from <built-in>:2: [composable_kernel] In file included from /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp:9: [composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm.hpp:23: [composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/block/block_gemm_asmem_bsmem_creg_v1.hpp:7: [composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/block/block_gemm_asmem_bsmem_creg_v1_default_policy.hpp:8: [composable_kernel] /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp:185:1: error: implicit instantiation of undefined template 'ck_tile::impl::warp_gemm_dispatcher::Dispatcher<_Float16, _Float16, float, 16, 16, 8, false, false, false, ck_tile::WGAttrNumAccessEnum::Single, ck_tile::WGAttrNumAccessEnum::Single>' ``` ## Technical Details ### Changes Made: #### 1. __gemm_streamk_validation_utils.py__ - Added module-level storage: `_configured_gpu_targets` - Added `set_gpu_targets(targets: List[str])` to configure fallback GPU targets - Added `get_configured_gpu_targets() -> List[str]` to retrieve configured targets - Enhanced `get_gpu_name_by_id()` to: - First try `rocminfo` (existing behavior) - If `rocminfo` fails, fall back to first configured GPU target - Extract base gfx name (e.g., "gfx90a" from "gfx90a:xnack+") - Log debug messages when using fallback #### 2. __gemm_streamk_instance_builder.py__ - Added `--gpu_targets` command-line argument - Automatically calls `set_gpu_targets()` when `--gpu_targets` is provided - Parses semicolon-separated GPU target list from CMake #### 3. __test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt__ - Modified both `--list_kernels` and `--gen_single` invocations to pass `--gpu_targets "${SUPPORTED_GPU_TARGETS}"` - GPU targets are now automatically wired from CMake to Python scripts ### How It Works: 1. __CMake Configuration__: `SUPPORTED_GPU_TARGETS` is determined from `GPU_TARGETS` or defaults 2. __CMake → Python__: CMake passes targets via `--gpu_targets` argument to Python scripts 3. __Python Configuration__: Scripts call `set_gpu_targets()` to configure the fallback 4. __Fallback Mechanism__: When `rocminfo` fails, `get_gpu_name_by_id()` uses the first configured target 5. __Target Parsing__: Extracts clean gfx name (e.g., "gfx90a" from "gfx90a:xnack+") ## Test Plan Confirm that only the appropriate kernels are selected and that CI passes. ## Test Result 1. Waiting on CI 2. Compilation succeeded locally and the kernel list does not contain the 16x16x8 kernel for gfx942 anymore: ``` (.venv) bhargrea@ctr-cx66-mi300x-02:~/github/TheRock$ cat build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_kernel_list.txt gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_16x16x16\|256x256x32_2x2x1_16x16x16\|compv3_cshuffle_intrawave_atomic_False_False_False_True gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_16x16x16\|256x256x32_2x2x1_16x16x16\|compv3_cshuffle_intrawave_atomic_False_False_False_False gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_16x16x32\|256x256x32_2x2x1_16x16x32\|compv3_cshuffle_intrawave_atomic_False_False_False_True gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_16x16x32\|256x256x32_2x2x1_16x16x32\|compv3_cshuffle_intrawave_atomic_False_False_False_False gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_32x32x8\|256x256x32_2x2x1_32x32x8\|compv3_cshuffle_intrawave_atomic_False_False_False_True gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_32x32x8\|256x256x32_2x2x1_32x32x8\|compv3_cshuffle_intrawave_atomic_False_False_False_False gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_32x32x16\|256x256x32_2x2x1_32x32x16\|compv3_cshuffle_intrawave_atomic_False_False_False_True gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_32x32x16\|256x256x32_2x2x1_32x32x16\|compv3_cshuffle_intrawave_atomic_False_False_False_False ``` ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-10 17:11:56 -06:00
Sami Remes	8b46a8d997	[CK_TILE] MX GEMM non-preshuffled RCR layout (#4594 ) ## Motivation Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled layouts using async pipeline. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2026-03-10 20:12:05 +00:00
Max Podkorytov	90a7253d86	[CK] Precompute SpaceFillingCurve indices to reduce compile time by 31% (#5041 ) ## Summary Optimize `SpaceFillingCurve` in CK to reduce compile time by precomputing all index values into a static constexpr lookup table. ### Problem - `GetIndex<N>` was instantiated separately for every index value (0 to NumAccesses-1) - Each instantiation triggered nested `static_for` loops with O(N²) template depth - This caused 34,000+ template instantiations taking 69 seconds in frontend ### Solution - Add `IndexLookupTable<NumAccesses, nDim>` to store all precomputed indices - Add `compute_single_index()` helper using O(N) `static_for` loops - Add `compute_all_indices()` to build entire table in one constexpr evaluation - `GetIndex<N>` becomes simple array lookup: `return index_table[N]` ### Results (conv2d_fwd_xdl_nhwc_kyxc_nhwk_f16_instance.cpp) \| Metric \| Before \| After \| Improvement \| \|--------\|--------\|-------\|-------------\| \| Total compile time \| 120.4s \| 83.6s \| -31% \| \| Frontend time \| 88.7s \| 52.6s \| -41% \| \| GetIndex instantiations \| 34,176 \| 384 \| -99% \| \| GetIndex time \| 69.0s \| 0.11s \| -99.8% \| \| SpaceFillingCurve time \| 75.7s \| 4.3s \| -94% \| ## Test plan - [x] Builds successfully with `-Werror -Weverything` - [ ] Run existing unit tests - [ ] Verify numerical correctness on sample kernels 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2026-03-10 12:40:08 -07:00
Brock Hargreaves	37f3c99ea0	[CK] Streamk tile engine test not setting a reasonable CU_COUNT default when the query fails (#5165 ) ## Motivation The following error was coming up when compiling on Windows when the generate_configs.py file tries to query the GPU for the number of CU's: ``` [composable_kernel configure] -- Generating Stream-K test config files for fp16 [composable_kernel configure] Traceback (most recent call last): [composable_kernel configure] File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 277, in <module> [composable_kernel configure] main() [composable_kernel configure] ~~~~^^ [composable_kernel configure] File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 271, in main [composable_kernel configure] cu_count, configs_dir_path, tile_sizes, datatype = get_args() [composable_kernel configure] ~~~~~~~~^^ [composable_kernel configure] File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 267, in get_args [composable_kernel configure] return (int(args.cu_count), args.configs_dir_path, args.tiles, args.datatype) [composable_kernel configure] ~~~^^^^^^^^^^^^^^^ [composable_kernel configure] ValueError: invalid literal for int() with base 10: 'Exit code 0xc0000135\n' [composable_kernel configure] CMake Error at test/ck_tile/gemm_streamk_tile_engine/generate_configs.cmake:98 (message): [composable_kernel configure] Eror occured during execution of [composable_kernel configure] E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/gemm_streamk_tile_engine/generate_configs.py [composable_kernel configure] Call Stack (most recent call first): [composable_kernel configure] test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt:301 (generate_test_configs) [composable_kernel configure] [composable_kernel configure] [composable_kernel configure] -- Configuring incomplete, errors occurred! [composable_kernel configure FAILED WITH CODE 1 in 41 seconds] ninja: build stopped: subcommand failed. ``` ## Technical Details There was one major problem in the following code and two changes were made: ``` execute_process( COMMAND ${CPP_EXE_PATH} OUTPUT_STRIP_TRAILING_WHITESPACE ERROR_VARIABLE standard_error RESULT_VARIABLE queried_cu_count ) if (standard_error) message(STATUS "Error information from attempting to query HIP device and properties:\n" "${standard_error}") endif() ``` 1. RESULT_VARIABLE does not capture the IO output of the executable, but rather the exit code. You can see from the error output here that it was trying to cast "Exit code 0xc0000135\n" to an integer. I fixed this by changing RESULT_VARIABLE to OUTPUT_VARIABLE. ``` [composable_kernel configure] ValueError: invalid literal for int() with base 10: 'Exit code 0xc0000135\n' ``` Note that this also gives us the reason that the query failed: Exit code 0xc0000135, which needs to be addressed in a separate issue: "Exit code 0xc0000135, also seen as -1073741515, is a Windows error indicating that an application failed to start because a required Dynamic Link Library (DLL) file or a system component like the .NET Framework is missing or corrupted" It's likely the executable that is created from this code can't find the hip dll, or something similar: ``` set(CPP_FILE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cu_count.cpp) set(CPP_EXE_PATH ${CMAKE_CURRENT_BINARY_DIR}/cu_count) execute_process( COMMAND ${CMAKE_HIP_COMPILER} -x hip ${CPP_FILE_PATH} -o ${CPP_EXE_PATH} RESULT_VARIABLE compile_result ) ``` 2. For clarity and consistency purposes, I changed the check afterwards to explicitly look for a non-zero exit code. This matches previous checks in the cmake file. I also added improved error checking when the query for the cu count fails. ## Test Plan Ensure it compiles locally and existing CI isn't impacted. ## Test Result Waiting on CI. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Emily Martins <emily.martins@amd.com>	2026-03-10 10:52:18 -07:00
Hosang	559ad6f0b1	[CK_TILE] Update gfx11 FMHA forward kernel configs (#5088 ) ## Motivation Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded seqlen_q/k) cases. This tuning is based on heuristic search and improves performance in most tested shapes. Performance should be evaluated on top of [`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018) (required baseline). ## Technical Details - Updated gfx11 codegen heuristic choices for tile size and occupancy. - Updated gfx11 pipeline selection: - Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad cases are dispatched to the faster kernel path.` - Kept gfx12 unchanged: with PSSK support from [`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957), existing gfx12 config is already sufficient. - Tuning rationale: - In some cases, higher `kBlockPerCu` lowers register pressure. - On RDNA, this generally aligns with better performance when `waves_per_eu >= 6`. ## Test Plan - test_ck_tile_fmha - tile_example_fmha_fwd: tested this on gfx1100 and gfx1151 ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - TFLOPs by sequence length target: `gfx1100` layout: `bhsd` - mode: batch / VGPR usage: 225 vs 214 SeqLen \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 74.10 \| 71.97 \| 0.97x 4096 \| 66.26 \| 77.79 \| 1.17x 8192 \| 68.18 \| 75.88 \| 1.11x 12288 \| 68.47 \| 80.44 \| 1.17x 16384 \| 59.54 \| 79.66 \| 1.34x 20480 \| 55.78 \| 77.91 \| 1.40x 24576 \| 55.08 \| 77.47 \| 1.41x 27280 \| 47.45 \| 77.16 \| 1.63x - mode: group / VGPR usage: 256 vs 214 SeqLen \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 71.47 \| 70.6 \| 0.99x 4096 \| 64.74 \| 77.06 \| 1.19x 8192 \| 64.68 \| 75.47 \| 1.17x 12288 \| 66.43 \| 79.95 \| 1.20x 16384 \| 56.02 \| 79.73 \| 1.42x 20480 \| 50.21 \| 78.15 \| 1.56x 24576 \| 47.29 \| 77.53 \| 1.64x 27280 \| 46.13 \| 77.04 \| 1.67x - TFLOPs by sequence length target: `gfx1151` layout: `bshd` - mode: batch / VGPR usage: 225 vs 223 Batch \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 26.85 \| 29.17 \| 1.09x 4096 \| 24.75 \| 26.01 \| 1.05x 8192 \| 25.24 \| 25.50 \| 1.01x 12288 \| 25.18 \| 25.00 \| 0.99x 16384 \| 24.79 \| 25.91 \| 1.05x 20480 \| 25.56 \| 25.24 \| 0.99x 24576 \| 25.13 \| 26.20 \| 1.04x 27280 \| 10.78 \| 26.35 \| 2.44x - mode: group / VGPR usage: 256 vs 229 Batch \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 27.44 \| 26.71 \| 0.97x 4096 \| 21.89 \| 23.09 \| 1.05x 8192 \| 22.85 \| 24.49 \| 1.07x 12288 \| 24.33 \| 24.42 \| 1.00x 16384 \| 20.05 \| 24.98 \| 1.24x 20480 \| 14.70 \| 25.15 \| 1.71x 24576 \| 11.30 \| 26.31 \| 2.33x 27280 \| 10.10 \| 26.32 \| 2.61x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-10 09:46:41 -07:00
Bartłomiej Kocot	c749ef9da3	[CK][CK Tile] Add grouped conv backward weight tile test and fix tr load in BASE_V1 pipeline (#5115 ) ## Motivation Test grouped conv backward weight from ck tile and fix incorrect values. ## Technical Details - Add test for CI - Add daily tests - Fix transpose load in BASE_V1 pipeline ## Test Plan test_grouped_convnd_backward_weight_tile ## Test Result in progress ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-783	2026-03-10 03:03:04 +00:00
John Shumway	751e29ccb6	[CK_BUILDER] Clean up ConvDescription output formatting (#5085 ) The `ConvDescription::getDetailedDescription()` output had several issues that made it harder to read and potentially misleading: 1. Bug fix: The LDS padding field was incorrectly displaying `dst_scalar_per_vector_k1` instead of the actual `lds_padding` value 2. Noise reduction: Optional parameters that weren't set were printing unhelpful messages like "Struct does not contain optional gemm_padding argument" — these add clutter without providing value to the reader 3. Formatting inconsistencies: Trailing spaces after colons (e.g., `"Warp Gemm parameters: "`) and a stray trailing `×` in tile dimensions 4. Missing thread cluster lengths: The threads per axis are not shown. Changes: - Fixed the LDS padding bug by using `traits_.a_tile_transfer.transfer_params.lds_padding` and `traits_.b_tile_transfer.transfer_params.lds_padding` instead of duplicating `dst_scalar_per_vector_k1` - Simplified optional parameter handling: Changed from printing "Struct does not contain..." messages to simply omitting absent optional values. Also switched from `.value_or()` to direct dereference (``) since we're already inside an `if` check - Cleaned up formatting: Removed trailing spaces after colons and the extra `×` at the end of tile dimension lists - Added missing thread cluster lengths: Added X×Y×Z" display for both A and B tile transfer sections. - Fixed typo: "Do Padd Gemm" → "Do Pad Gemm" - Fixed typo: "scr" → "src" - Fixed typo: "tensros" → "tensors" - `ninja smoke-builder` ✓ - `ninja check-builder` ✓ The test file updates reflect the corrected expected output, which now shows the actual `lds_padding` values (0 or 1), shows thread cluster lenths, and omits the verbose "Struct does not contain..." lines. Note*: This PR follows PR #5083.	2026-03-09 22:36:30 +00:00
kensclin	362bfd72e8	[CK] Fix the issue of the aiter to call eightwarps pipeline. (#5218 ) ## Motivation Fix the failure of the aiter to call eightwarp. Changed Async to the name eightwarps. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan Pass ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-09 11:12:13 -07:00
rocking	baa73e1515	[CK_TILE] Fix FMHA async pipeline LDS sync issue (#4742 ) ## Motivation Fix FMHA forward async pipeline (`block_fmha_pipeline_qr_ks_vs_async.hpp`) sync issue. Some attention test cases intermittently fail due to a race condition where the V tile store to LDS overwrites K tile data that is still being read by other threads during the tail `gemm_0` operation. ## Technical Details In the `BlockFmhaPipelineQRKSVSAsync` pipeline, K and V tiles share the same LDS memory through a rotation schedule (`LdsSeq`). After the tail `gemm_0` (line 458), some fast threads may proceed to store V to LDS (line 617) before slow threads finish reading K data from the same LDS buffer. The fix adds an `s_barrier` synchronization after the tail `gemm_0` when K's last sub-tile and V's first sub-tile use the same LDS buffer (i.e., `LdsSeq[k0_loops - 1] == LdsSeq[k0_loops]`): `if constexpr(LdsSeq.at(number<k0_loops - 1>{}) == LdsSeq.at(number<k0_loops>{})) __builtin_amdgcn_s_barrier();` Why `s_barrier` alone is sufficient (no s_waitcnt lgkmcnt(0) needed): The `gemm_0` MFMA instruction internally waits for its LDS operands (ds_read) to complete before execution Therefore, each thread's ds_read of K data is already complete by the time gemm_0 finishes Only cross-thread synchronization (`s_barrier`) is needed to ensure all threads have finished reading before any thread starts writing V --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-03-09 18:05:36 +00:00
Márton Bidlek	a862155c9e	Proof of concept for removing forward declarations (#5135 ) ## Motivation Currently, we forward declare CK device operation templates in CK-Builder's reflection code: `9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)` This is mainly required to break a circular dependency in reflection. The architecture of that is as follows: MyDeviceOp implements GetInstanceString(). This is typically defined directly in the class definition (no forward declaration). GetInstanceString() calls instance_string<MyDeviceOp>() instance_string<MyDeviceOp>() calls InstanceTraits<MyDeviceOp>::instance_string() InstanceTraits has a specialization for MyDeviceOp which implements instance_string() So order for GetInstanceString() to work properly, InstanceTraits must already be defined. And for InstanceTraits to be defined, the device op needs to be defined. In order to do that, we are currently using aforementioned forward declaration. ## Technical Details C++'s lazy template evaluation is used by calling into an as-of-yet undefined function static member function of `InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then specializing `InstanceTraits` only _after that_. The caveat here is that both the device op itself as well as the instance traits specialization must be in scope, otherwise there would be an undefined function error. In practise, we can solve that either by placing the instance traits directly into the file that defines `MyDeviceOp`, or possibly by using a `.inc` file to keep the concerns separated. ## Test Plan The results were verified by running the existing regression tests for CK Builder ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>	2026-03-09 09:34:18 -07:00
Christopher Millette	8765b5f417	[CK_TILE] Fix MMA layout test to match amdgcn_mma OpFamily parameter (#5222 ) ## Summary - PR #4837 added `MmaOpFamily OpFamily_` as a new template parameter to `amdgcn_mma` and `MmaDefaultSelector`, but the MMA layout test (PR #4495) was not updated to include it - Add the missing `OpFamily_` parameter to all three `RegisterMapTraits` partial specializations (gfx9, gfx11, gfx12) and all `MmaDefaultSelector` usages - Fixes build failure: `template argument for non-type template parameter must be an expression` ## Test plan - [x] Verified test compiles cleanly with ROCm 7.1.1 clang++ targeting gfx90a - [x] `test_amdgcn_mma_layout` gfx90a (MFMA): PASSED - [x] `test_amdgcn_mma_layout` gfx1201 (WMMA): SKIPPED (no device) - [x] `test_amdgcn_mma_layout` gfx1100 (WMMA): SKIPPED (no device) - [x] CI validation on all GPU targets 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 22:17:56 -07:00

1 2 3 4 5 ...

3993 Commits