composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
Kiefer van Teutem	2089713f94	[rocm-libraries] ROCm/rocm-libraries#8227 (commit 75c30d5) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?Remove=20unification=20Flag=20structs=20in=20favor=20of=20new?= =?UTF-8?q?=20WarpGemmParams=20(#8227)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Recently, the way flags are sent down to the intrinsics was changed in CK Tile. At the point where the WarpGemm is invoked, an arbitrary number of template parameters can be passed, and these are passed down all the way to the lowest level intrinsics wrappers. Here `WarpGemmParamsParser<>` is used to extract flags for the intrinsics. In this MR we adapt the the unification framework (amdgcn_mma struct and MmaPipelines) to work in the same way. By doing this, there is no longer a point in our custom intrinsic Flag structs, so these are removed. Unrelated but I also tried removing the MmaPipeline flags because they arn't used for anything except CTranspose, which is already available. This also make test_amdgcn_mma_pipeline completely redundant so removed that as well.	2026-06-26 12:00:58 +00:00
Kiefer van Teutem	137f2a9a10	[rocm-libraries] ROCm/rocm-libraries#7407 (commit 0b79e05) [CK TILE] Initial integration of MFMA / WMMA unification framework into CK Tile (#7407) (locked behind flag) Note: Everything works but this is still a draft MR because I want to do some more cleanup and maybe do some testing for MX fp6. Also please don't trigger copilot, I will do this once I feel it is clean enough, otherwise I'll get a bunch of comments about stuff I already know. ## Motivation The point of this MR is to finally use our unification MmaPipelines to replace the existing WarpGemms in CK tile and make sure everything works. I focused on gfx908 and gfx950 for now, dense and scale intrinsics, fp16, fp8, and fp4. I managed to get CK tests / examples working for all of these scenarios, so the basic implementation should be correct. I expect some more tweaks will be required to get full support, some of which I already anticipated in the section "New issues". ## Big switch: USE_NEW_UNIFIED_FRAMEWORK When USE_NEW_UNIFIED_FRAMEWORK is 1, we replace all WarpGemms with MmaPipelines from the new unified framework. This means WarpGemmDispatcher will use the UnificationDispatcher instead of the regular Dispatcher. Furthermore, named WarpGemms like WarpGemmMfmaF32F32F32M16N16K4 will also get rerouted to the UnificationDispatcher. The latter is necessary because some pipelines bypass the WarpGemmDispatcher in favor of directly using named WarpGemms. For now the switch is turned on for easier testing, so don't expect the CI to pass. When off, this MR should not affect any of the CK tile tests at all so I would expect the CI to pass. ## Simplification of MmaPipelineBase I found that the structure of MmaPipelineBase was a bit complex and I was able to reduce it a lot. The only thing an Mma Pipeline does (currently) is provide a wrapper around amdgcn structs that allows k iteration and sparse compression. We don't allow M and N composition for now for simplicity and since this is not expected from WarpGemms in CK Tile currently. ## Re-interpretation of tile distribution encodings for packed datatypes Tile distributions for packed types are expected to describe mathematical elements, not datatype elements! This distinction is why the gfx950 fp4 CK_tile tests were not working. Updated the interpretation in amdgcn_mma, tile distribution calculator, and layout test, along with comments. Tested on all architectures. ## getCMakeCompilerTarget() for configuration time target architecture This is a workaround because there are a lot of cases in CK Tile where the host code inspects Device constructions like WarpGemm, and we need to get the version that will be used on the device. This is a big kludge and we need to figure out a better solution. Also this util will always pick the first cmakelists target arch, so there will be issues when compiling for multiple target architectures. Ideally, the host code should not touch the WarpGemms at all, and there would be no issue. This has been a point of friction in CK for a long time. We can discuss this with Chris Millette. ## Tests I was able to verify that the following CK Tile tests and examples work with the new unified framework: tile_tutorial_mfma_16x16x16 (gfx9, fp16, uses transpose) tile_example_gemm_basic (gfx9, fp16) test_ck_tile_mx_gemm_async (gfx950, microscaling fp8 and fp4) Within the tile tutorial I was also able to use WarpGemmMfmaF16F16F32M16N16K32TransposedCDistribution instead of WarpGemmMfmaF16F16F32M16N16K16TransposedCDistribution to verify that basic K iteration also works. A little while ago I also verified that the performance did not change in a measurable way, and the compile did not change much but did see some swings up to 20% each way (faster or slower). We will need some broader and more accurate tests for this going forward. ## Moving forward To confidently be able to replace the existing Dispatcher and WarpGemm framework with our own, we need to make sure that all existing tests and examples work on all platforms. Furthermore, we should pay attention to performance and compile time of all these tests. Performance should definitely not change, as all we're doing is refactoring the support structure around the intrinsics, which should melt away during compilation. ## New issues (I will make new issues with descriptions for these but here is a short list (incomplete): Test RDNA CK Tile pipelines Test Sparse Ck Tile pipeline (does not exist but we can make one) Remove MmaOp flags from unification framework and update it to work with new WarpGemmParamsParser instead. Add Swizzle support and test in CK Tile pipelines. Test Scale + transpose Ck Tile pipelines. Coherent strategy for attrnumaccess for dense, scale, default, packed, wmma, gfx1250, etc in CK tile. It's messy now. Dispatcher should not be determining scale-ness of intrinsics based on MNK sizes. Try adding back the MN composition in MmaPipelines Why is test_amdgcn_wavewise_mma only compiled for CDNA? Investigate NOP and AGPR flags Maybe get rid of WmmaTag in dispatcher. Find a coherent strategy for dealing with host vs device compile passes, and the host sneaking a peak at WarpGemm internals. Related to getCMakeCompilerTarget(). ## TODO before merge Some changes exist just for ease of testing, and will be reverted before merging: - gemm_basic.cpp has a lot of datatypes disabled because otherwise compile time is huge for testing - USE_NEW_UNIFIED_FRAMEWORK is set to 1 for easier testing	2026-06-24 13:35:25 +00:00
Enrico Degregori	2733e75900	[rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e) [CK Tile] Async support pipeline V3 ## Motivation Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async pipeline) ## Technical Details - Add `Async` bool to `Problem` struct to enable async pipeline in existing one - Add `static_move_ys` to load transpose. This generates offset in assembly instructions saving registers - Add `is_valid` to `async_get_vectorized_elements`. Before hard coded to true. It allows to support padding - Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr` (wider use of lds load transpose on gfx950) - Integrate async support in existing V3 pipeline (avoid pipelines duplication) - Create policy to support both async and default cases. This could be used by any async pipeline (next steps) - Define `wg_attr_num_access` separately for A and B. This allows to optimize ds_read instruction width for cases when one matrix is transposed and the other is not. Before in such cases, `ds_read_b64` was used instead of `ds_read_b128` - Add test for V3 async. Currently only supporting cases with A and B having the same type ## Test Plan New test `test_ck_tile_gemm_pipeline_compv3_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 06:57:14 +00:00
Sami Remes	a3a12b8945	[rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf) [CK_TILE] Enable full transpose layout support for MX GEMM pipeline (#5813) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Enable full transpose layout support for MX GEMM pipeline (32x32x64 MFMA) ### Summary This PR enables all four matrix layout combinations (Row/Col, Row/Row, Col/Col, Col/Row) for the MX GEMM pipeline with `32x32x64` MFMA warp tiles, using `ds_read_tr` transposed LDS loads on gfx950. Previously, only the canonical `A=RowMajor, B=ColumnMajor` layout was supported. ### Changes Kernel-side transpose support: - `warp_gemm_attribute_mfma.hpp`: Introduce `kSplitFactor` logic in `get_warp_dstr_encoding` to split the K-dimension distribution encoding when `kPerLane` exceeds the `ds_read_tr` subtile minor dimension. This satisfies the `TransposeTileDistributionTraits` suffix validation required by `load_tile_transpose`. The distribution encoding now also receives the `DataType` template parameter to compute the split factor based on packed element size. - `gemm_pipeline_ag_bg_cr_comp_async.hpp`: Uncomment and enable the `InputTileDistributionTraits` logic to properly transform LDS load tile distributions for transposed reads. Add `static_assert`s to catch misconfigurations where a layout requires transpose loads but the warp tile size disables them (e.g. `KWarpTile=128` exceeds `ds_read_tr` limits). - `load_tile_transpose.hpp`: Fix `DataVec` sizing for packed types (`pk_fp4_t`) — divide `vecLoadSize` by `PackedSize` to prevent buffer overflow when each physical element contains multiple logical values. - `warp_gemm_attribute_mfma_impl.hpp`: Set `kDefaultScale` to `0x7F7F7F7F` (unity in e8m0 format) for the unscaled `operator()` overloads of `WarpGemmAttributeMfmaImpl_f32_32x32x64_f8f6f4`, ensuring correct behavior with `mfma_scale_f32_32x32x64_f8f6f4`. - `warp_gemm.hpp` / `warp_gemm_dispatcher.hpp`: Add generic `WarpGemmMfma_f32_32x32x64_f8f6f4<A, B>` alias and dispatcher specialization to support arbitrary MX data type combinations (fp4, fp6, fp8) with the 32x32x64 MFMA, consolidating the existing type-specific aliases. - `gemm_pipeline_ag_bg_cr_comp_async_default_policy.hpp`: Simplify `wg_attr_num_access` determination — `Double` for fp8, `Single` otherwise. Reference implementation fix: - `reference_gemm.hpp`: Fix nibble selection for packed 4-bit types (`pk_fp4_t`, `pk_int4_t`) in `reference_mx_gemm`, `reference_gemm`, and `reference_gemm_abquant`. The previous logic used `k % 2` or `index[K_DIM] & 1` to select which nibble to extract, which assumed K was always the fast (contiguous) memory dimension. This is only true for `A=RowMajor` / `B=ColumnMajor`. For other layouts, the fix computes the flat memory offset via `mDesc.GetOffsetFromMultiIndex(...)` and uses its parity to correctly select the nibble regardless of layout. Test infrastructure: - `test_mx_gemm_config.hpp`: Add `MxGemmConfig32` base and `MXfp4_GemmConfig32` / `MXfp8_GemmConfig32` configs for the 32x32x64 warp tile. - `test_mx_gemm_fp4.cpp` / `test_mx_gemm_fp8.cpp`: Add `Config32` test suites covering all four layout combinations. Restrict `Config16` (16x16x128) to `A=Row, B=Col` only, since `KWarpTile=128` exceeds `ds_read_tr` limits. - `test_mx_gemm_util.hpp`: Fix scale tensor layout — scales are always row-major `[M, K/32]` and column-major `[K/32, N]`, independent of A/B data layout. ### Test plan - [x] `test_ck_tile_mx_gemm_fp4` — 5/5 passed (16x16x128 Row/Col + 32x32x64 all 4 layouts) - [x] `test_ck_tile_mx_gemm_fp8` — 5/5 passed (16x16x128 Row/Col + 32x32x64 all 4 layouts) - [x] `test_ck_tile_mx_gemm_fp6` — 1/1 passed (16x16x128 Row/Col)	2026-06-18 17:05:09 +00:00
SamiAario-AMD	947dcc2606	[rocm-libraries] ROCm/rocm-libraries#5510 (commit 8415c8c) [CK Tile] Add transposed tile load implementation, and tests for load_and_convert_tile (#5510) ## Motivation Mixed precision b/fp16 x fp8 requires a transposed tile load implementation that supports mixed precision using these types. Implement this, use it in `load_and_convert_tile`, and add a unit test for `load_and_convert_tile` which covers this functionality. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 06:42:28 +00:00
Wojciech Laskowski	c2601f38b7	[rocm-libraries] ROCm/rocm-libraries#6569 (commit 393049e) Adding amdgcn_mma specializations for sparse MFMA builtins (#6569) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the fourth of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801, https://github.com/ROCm/rocm-libraries/pull/6014 and https://github.com/ROCm/rocm-libraries/pull/6567) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse MFMA intrinsics. ## Technical Details This change adds new specializations for MFMA sparse builtins. In total, we add 27 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 12:48:29 +00:00
Emily Martins	97ca00e449	[rocm-libraries] ROCm/rocm-libraries#7836 (commit cdd9958) [CK Tile] Stream-K RDNA Support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Currently, CK Tile Stream-K only supports CDNA architectures. This change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures. ## Technical Details Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear, and 3) tree. The linear and tree reductions require inter-workgroup communication to a global flags buffer and a global partials buffer. To ensure cache coherency, we use cache modifiers to skip cache levels that are not visible to all workgroups. On CDNA architectures, scalar load and scalar store instructions are available, which we use to read and write to the flags buffer with appropriate cache skipping modifiers. However, RDNA architectures do not support scalar store instructions, so workgroups must use a buffer store instruction to write to flags. Additionally, cache modifiers differ between CDNA and RDNA; they also differ between RDNA3 and RDNA4. Given this information, the main changes are as follows: - Added RDNA flag signaling: Use buffer store instructions for writing to global flags buffer - Add appropriate cache modifiers for reading and writing to flags and partials: - RDNA3 (gfx11): Use `glc \| dlc` coherence flags - RDNA4 (gfx12): Use `DEVICE` coherence scope - SFINAE-guarded overloads: Added compile-time dispatch for `SignalStorePartialDone()` and `WaitStorePartialDone()` based on target architecture - RDNA alignment requirements: Increased flags buffer alignment from 128B to 256B due to RDNA cache line size A note about the `amd_buffer_coherence_enum`: - Problem: The `amd_buffer_coherence_enum` uses preprocessor conditionals (`#if defined(__gfx12__)`) to define architecture-specific values. Template specializations reference enum values from different architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name lookup, non-dependent names are resolved during template parsing regardless of which architecture is being compiled, causing compilation failures when referenced values do not exist in the active preprocessor branch. - Temporary Solution: Added compatibility enum values to each architecture block. For example, I added `glc_dlc` in the `__gfx12__` block. I will create a ticket to refactor this enum with a design that has better scalability and tries to avoid the use of preprocessor conditionals. ## Test Plan ### Summary gtests were added to test wmma variants of Stream-K. These tests were stressed tested locally on gfx11 and gfx12. ### More details This PR makes the following changes/additions to the Stream-K gtests: - Split tests into MFMA (CDNA) and WMMA (RDNA) variants - Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction - WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size supported on RDNA) - Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU count - As described in [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898), when in Workgroup Processor (WGP) mode, the value of `hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single WGP contains two CUs. The default mode on RDNA is WGP mode, so when creating (M, N, K) instances for gtests using the CU count, we need to multiply the CU count by 2 to get the correct value. This is not needed in the kernel host code, because the occupancy ensures that overall `max_active_wgs` is correct. ## Test Result All tests pass locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-08 22:48:10 +00:00
Bartłomiej Kocot	2c363870d9	[rocm-libraries] ROCm/rocm-libraries#6744 (commit 9d056e8) [Ck][CK Tile] Global Load/Store for Large Tensors support (#6744) ## Motivation Create solution to support large tensors in the entire ck tile. ## Technical Details - add possiblity to use global load - int64 indexing ## Test Plan conv fwd tests ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-913	2026-06-06 10:14:17 +00:00
Yung-sheng Tu	e826b2eb7e	[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR adds two specialisations related to `tf32_t`. ## Technical Details This change treats `tf32_t` as a concrete type rather than an empty `struct`. It also adds two new specialisations for MFMA dense builtins and resolves existing circular include issues. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 12:27:41 +00:00
Illia Silin	aef7b42883	[rocm-libraries] ROCm/rocm-libraries#7816 (commit f6324af) [CK] Fix latest build issues with staging compiler. ## Motivation Fixing new warnings with staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-04 17:41:09 +00:00
John Afaganis	96c39b331e	[rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da) [ck] Enforce ASCII-only C/C++ sources for hipRTC compatibility (#7829) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary CK source files must be compilable via hipRTC (HIP runtime compilation), whose preprocessor does not accept non-ASCII bytes anywhere in a translation unit — including in comments. Bytes that are harmless under `hipcc` (em-dashes, smart quotes, multiplication signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at preprocessing time. These regularly leak in via LLM-assisted authoring or copy/paste from formatted documents and silently break hipRTC paths that are not exercised by the default `hipcc`-based build matrix. This PR (a) cleans every existing violation (53 files) and (b) adds a pre-checkin gate so new violations are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). This is a strict superset of the existing `Clang Format` stage's predicate — `.inc` is added so test-fixture include files are also gated. The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage* in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `79798aa6261` — `[ck] Convert reflect/ rendering to ASCII for hipRTC compatibility` Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `\|- / +- / \| ` (3-col width preserved so alignment is unchanged). `conv_description.hpp` swaps `×` for `x` as the dimension separator. `test_conv_description.cpp` expected strings updated in lockstep so the snapshot test stays green. This is the only commit in the series with observable runtime impact. 2. `738fdb0d81c` — `[ck] Strip non-ASCII bytes from C++ sources for hipRTC compatibility` Mechanical text cleanup across 53 files. Replacements happen in comments or in `std::cout` strings that are not asserted on by any test. None of the 174 `.inc` files in the tree required edits, but they were in the scan's predicate so the enforcement stage's predicate is a superset of what was scanned. Full replacement table in the commit message. 3. `1d7cd8ba235` — `[ck] Enforce ASCII-only C/C++ sources for hipRTC compatibility` - New `projects/composablekernel/script/check_ascii_only.sh` (modeled on `check_copyright_year.sh`). - New entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `ASCII Only Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the existing `Clang Format` stage but with `.inc` added to the find predicate. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 50 known violations; commit 2 leaves 0; commit 3 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains non-ASCII bytes: 1:// em-dash test — here Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.) $ echo $? 1 ``` Full repo scan after the cleanup commits (note the `-name '.inc'` clause): ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_ascii_only.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check` stage runs green over the full predicate (incl. `*.inc`) and existing `Clang Format` stage is unaffected. - [ ] `test_conv_description` passes against the ASCII tree-formatter output (touched in commit 1). - [ ] Local: `pre-commit run ascii-only-checker --all-files` runs cleanly after installing CK pre-commit hooks via `script/install_precommit.sh`. - [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. - [ ] Spot-check a representative subset of touched files under hipRTC compilation to confirm no remaining hipRTC-blocking content (optional, since the static byte check is a sufficient condition for hipRTC preprocessor acceptance on this dimension). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-04 15:00:17 +00:00
Copilot	4fcd73a98e	[rocm-libraries] ROCm/rocm-libraries#7974 (commit 9df2c76) composablekernel: remove stray .hpp.bk backup artifacts (#7974) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four `.hpp.bk` files were accidentally committed to `projects/composablekernel/`, likely as leftovers from a prior merge or conflict resolution. Each is an older snapshot of its `.hpp` counterpart — the canonical `.hpp` files are newer and contain the correct current content. ## Deleted files \| File \| vs. `.hpp` counterpart \| \|---\|---\| \| `ck_tile/core/tensor/tile_window.hpp.bk` \| Older version: uses legacy `bool isL1Cache`/`PrefetchL1` template params; missing `DataCachePrefetchKind`-based prefetch API and `data_cache_prefetch.hpp` include \| \| `ck_tile/core/tensor/load_tile_transpose.hpp.bk` \| Older version: missing `#if defined(__gfx950__)` guard and `Quad` struct (~90 lines) for gfx1250 architecture \| \| `ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp.bk` \| Older version: missing `WmmaTag`, `IsScale16` template param, and several newer dispatcher specializations \| \| `ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp.bk` \| Older version: `KPackA`/`KPackB` (since renamed `KPack`); uses `static_ford` (since refactored to nested `static_for`) \| ## Verification - No other `.bk` files exist in `projects/composablekernel/`. - No build scripts, CMake files, includes, or documentation reference these `.bk` files. - No `.hpp` files were modified.	2026-06-04 03:06:43 +00:00
apophis	42c82b093e	[rocm-libraries] ROCm/rocm-libraries#7786 (commit 7842dfd) [CK TILE][Windows] add `msvc::no_unique_address` support for Windows (#7786) ## Motivation While building Flash Attention 2 with CK backend, this warning will spam in every kernel: ``` DEBUG [1/1837] hipcc.exe ... DEBUG In file included from H:\ROCm\flash-attention\build\fmha_fwd_d32_bf16_batch_b64x64x16x32x32x32_r4x1x1_r4x1x1_w16x16x16_w16x16x16_qr_vr_pssk_nlogits_alibi_mask_lse_ndropout_nskip_nqscale_ntrload_nsink_gfx12.cu:6: DEBUG In file included from H:\ROCm\flash-attention\csrc\composable_kernel\example\ck_tile\01_fmha\fmha_fwd.hpp:6: DEBUG In file included from H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core.hpp:111: DEBUG H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core/tensor/tile_scatter_gather.hpp:1246:7: warning: unknown attribute 'no_unique_address' ignored [-Wunknown-attributes] DEBUG 1246 \| [[no_unique_address]] std::conditional_t<kUseGlobalLoad_, PageIdxArray, gl_field_empty_t> DEBUG \| ^~~~~~~~~~~~~~~~~ DEBUG H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core/tensor/tile_scatter_gather.hpp:1254:7: warning: unknown attribute 'no_unique_address' ignored [-Wunknown-attributes] DEBUG 1254 \| [[no_unique_address]] std::conditional_t<kUseGlobalLoad_, index_t, gl_field_empty_t> DEBUG \| ^~~~~~~~~~~~~~~~~ DEBUG 2 warnings generated when compiling for host. ... ``` ## Technical Details `[[no_unique_address]]` is not working on Windows LLVM, should use `[[msvc::no_unique_address]]`. ## Test Plan Build FA2 with CK backend. ## Test Result No warnings, no errors. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-06-04 02:28:12 +00:00
chris-tsiaousis-hpc	db05d61136	[rocm-libraries] ROCm/rocm-libraries#6212 (commit ccee58d) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR solves several issues: #### More accurate tests for MmaPipelines The current tests for the MmaPipelines (test_amdgcn_sparse_mma, test_amdgcn_wavewise_mma) use explicit input fragment vectors filled with 1s, and only check the output of a single lane. We should have tests that actually use the MmaPipelines with non-trivial input matrices and verify the complete output. Some other aspects of the current MmaPipelines tests that I noticed and deserve some attention: 1. There is sometimes iteration over K outside of the pipeline, which is then included in WaveTileK or FragK, which is not correct. We should remove it, move K iteration inside of the pipeline, or be more clear about this outer-K loop size and how it propagates downwards. 2. There is very tight coupling between the kernel, gtest code, and test_pipeline helper, requiring a lot of information and functions to be passed back and forth. 3. The test_pipeline helper is doing a bunch of register-related logic on the host (related to point 1) 4. Without this register logic the only thing it does is check the device, call the kernel, and check the output, but with a lot of boilerplate. #### Test helper for detecting target arch at HOST runtime There is a really apparent issue we faced while writing tests: Scenario: 1. Compile a test that supports both gfx950 and gfx1201 for gfx950 2. Run the test on a server that only has gfx1201 GPU Actual: Segmentation fault Expected: The test can correctly detect from HOST runtime that the DEVICE target_id was different and skips the test. Notes: The only way of detecting the COMPILER_TARGET_ID in the existing "arch" framework is launching a kernel and calling `get_compiler_target()` (so, from a DEVICE code). This will create a segmentation fault if the current arch differs from the target arch. To cope with this issue, we propose to export the compiler target(s) (note they can be many) through `projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and define a test helper to deal with such cases. #### Add composition support to Transforms We have a small number of Transforms which act on MmaOp input and output data, before and after the MmaOp call respectively. These are currently implemented to work on an MmaTile level, but in theory they are also supposed to work at a WaveTile level, i.e. after composition of multiple MmaTiles to create larger effective MNK dimensions. Currently the composed MmaTiles look like 2D C-style arrays of the individual MmaTile level register vectors (see WaveWiseMmaPipeline). The transforms should be able to take these and perform the proper transforms to the whole WaveTile at once. This might allow for better performing transformations. Note: This PR handles the SparseTransform case and if we don't end up doing scale as a transformation, there isn't really much left to do. If we end up having only the sparse transform as a non-trivial transform, then we could also consider removing the Transform framework.	2026-06-03 14:35:18 +00:00
Yi DING	01bd52bdb5	[rocm-libraries] ROCm/rocm-libraries#7925 (commit a8f0845) [CK] Fix gfx950 AITER Sync Regressions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Fixes three gfx950 regressions in the AITER downstream CI that surfaced after the internal/gfx1250 re-sync (ROCm/rocm-libraries#6978): > Companion aiter PR: ROCm/aiter#3392 — host-side adaptations (`Kernel::BlockSize()` `constexpr` drops, blockscale `KBatch=1` clamp) plus the CK submodule bump used to validate these fixes together. - FlyDSL MoE AOT cache miss — the AITER MoE tests run with `check_aot_cache=True` and fail on any FlyDSL JIT cache miss, but the CI never pre-compiles the FlyDSL MoE kernels, so gfx950 always misses. Pre-compile them at the start of the AITER test stage. - `buffer.load.lds.v4i32` link error — ROCm/rocm-libraries#6978 reintroduced a clang-version guard mapping `llvm.amdgcn.raw.buffer.load.lds` to a `.v4i32`-suffixed name. That name exists in no LLVM (the rsrc operand is a fixed, non-overloaded `<4 x i32>`, so the intrinsic is never type-mangled), so gfx950 4-DWORD direct-to-LDS (e.g. fp4 MoE bpreshuffle) fails to link with `lld: undefined symbol: llvm.amdgcn.raw.buffer.load.lds.v4i32`. Use the canonical plain name unconditionally. - mixed-precision flatmm warp-GEMM call — ROCm/rocm-libraries#6978 generalized the scaled `WarpGemmImpl::operator()` from a fixed `<index_t opselA, index_t opselB>` signature to a variadic `<typename... Params>` one and updated the `mx_flatmm` pipeline to pass the op-selectors as `OpSelA<>`/`OpSelB<>` types, but missed the mixed-precision flatmm pipeline (`F8xMXF4`/`F16xMXF4`), which still passed raw integer op-selectors. These no longer bind to `typename... Params` (`error: no matching member function for call to 'operator()'`), breaking compilation of the fp8/bf16 × fp4 cktile MoE gemm1 instances on gfx950 (aiter `test_moe_2stage`). Wrap the op-selectors in `OpSelA<>`/`OpSelB<>`. ## Changes - `Jenkinsfile`: pre-compile the FlyDSL MoE AOT cache (`python3 aiter/aot/flydsl/moe.py`) before the AITER tests. - `include/ck/utility/amd_buffer_addressing_builtins.hpp` and `include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp`: drop the `__clang_major__` guard and always use `__asm("llvm.amdgcn.raw.buffer.load.lds")`. The plain name is the canonical one for all sizes including the gfx950 16-byte form, as the upstream LLVM gfx950 tests confirm. - `include/ck_tile/ops/flatmm/pipeline/mixed_prec_flatmm_pipeline_agmem_bgmem_creg_v1.hpp`: wrap the warp-GEMM op-selectors in `OpSelA<>`/`OpSelB<>` at the five call sites, matching the `mx_flatmm` pipeline. ## Test plan Validated via CI.	2026-06-03 02:09:05 +00:00
Illia Silin	5720589311	[rocm-libraries] ROCm/rocm-libraries#7960 (commit ddac5cf) [CK] Upgrade to new gfx1250 compiler and fix build issues (#7960) ## Motivation The docker image we've been using to build for gfx1250 is a few months old, so we need to upgrade. Some of the changes in the latest compiler version require changes in the code. TDM is temporarily disabled due to changes in the lds load/store intrinsics. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 01:58:59 +00:00
Johannes Graner	b7c8fb164f	[rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d) [CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight instances (#7937) ## Motivation CK Tile had no pipeline competitive with old CK's wavelet on the RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive kernel for spatial small-K shapes. ## Technical Details - New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`): workgroup split into math waves (LDS read + MFMA) and load waves (DRAM read + LDS write). - VGPR role-split: `operator()` has two top-level mutually-exclusive `is_math` branches so the allocator overlays both roles onto the same physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy. Correctness depends on identical `block_sync_lds` counts on both arms plus a matching load-wave barrier stub in the epilogue (`cshuffle_epilogue.hpp`). - Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`): `kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub. Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36 C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27% (googlenet K=320 by 16-23%); the role-split roughly halves the parity gap vs old CK on the 13x13 fp16 shape. ## Test Plan - `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16, `split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a broad 2D sweep. - Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64} (barrier-parity / deadlock check). - `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet instances. ## Test Result - All wavelet instances CPU-verify correct across the split-K sweep; no hangs (dual-arm barrier sequence matches). - Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better vs old CK on the majority of spatial shapes. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-02 08:51:17 +00:00
Tianyuan Wu	22a99f97e8	[rocm-libraries] ROCm/rocm-libraries#7677 (commit 308af93) [CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile ## Motivation This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and improves warp GEMM unit test coverage/structure for gfx1250-specific cases. ## Technical Details - Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added IsScale16 plumbing to warp GEMM dispatcher path - Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250 coverage to verify all F4 WMMA paths ## Test Plan Run ./test_ck_tile_wg_32x16x128_fp4. ## Test Result ``` ./test_ck_tile_wg_32x16x128_fp4 [----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (1751 ms total) [ PASSED ] 3 tests. ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-30 01:28:48 +00:00
Hosang Yoon	e7e8801dc3	[rocm-libraries] ROCm/rocm-libraries#7586 (commit c18f2c7) [CK_TILE] Use gfx11 float buffer atomics in FMHA Bwd ## Motivation FlashAttention CK backward on gfx11 can hit out-of-bounds/tail writes in the dQ accumulator atomic-add path when sequence rows are padded at the tile level but not marked invalid in the DQDKDV main tensor view. With the generic global atomic fallback, an incorrectly-valid tail element can issue an actual pointer-based `atomicAdd`. With the buffer atomic path, the write is issued through a buffer resource with bounds information and follows the same backend already used by gfx9/gfx12. This fixes the gfx11 FMHA BWD failure without changing the gfx11 default for unrelated CK Tile kernels. ## Technical Details This PR enables the existing CK Tile AMD buffer float atomic-add path only for generated FMHA BWD gfx11 translation units. gfx11 normally uses the generic global atomic fallback for floating-point `buffer_view::atomic_add`. That fallback performs the atomic through a raw computed pointer and depends on the software validity predicate to avoid invalid elements. In FMHA BWD dQ accumulation, padded tail rows can reach this path, so using the buffer atomic backend is safer: it uses a buffer resource with base pointer, bounds information, and an element offset, matching the backend already used by gfx9/gfx12. Enabling `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT` globally for gfx11 is too broad and can break unrelated gfx11 CK builds such as GEMM. Instead, `config.hpp` now preserves an explicitly pre-defined `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT`, while keeping the existing default disabled for gfx11. ## Test Plan Validated the change with the FlashAttention CK full test suite with backward pass enabled on gfx11. pytest -q -s tests/test_flash_attn_ck.py ## Test Result FlashAttention CK gfx11 test result: 260680 passed, 152076 skipped ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-05-30 00:10:26 +00:00
Andriy Roshchenko	d5c9215064	[rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f) [CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC ## Motivation Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async load instruction for tensor A, which complements the existing MX GEMM FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines are enabled on gfx1250. ## Technical Details The existing gfx950 implementation was extended to support gfx1250 architecture. All three MX FP data types are supported across the two ASICs. It should be noted that while the TDM pipeline uses an emulated 32x32x128 warp-tile instruction, the present submission relies on the built-in 16x16x128 instruction, called 4 times per warp. ## Test Plan Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250 functionality. To help facilitate the testing in development, `example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was introduced to verify various combinations of supported data types and pipeline versions. ## Test Result The present submission is expected to work on both gfx950 and gfx1250 hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. - [x] Relies on #6978 and should only be merged after the changes are merged to the `develop`.	2026-05-29 17:02:45 +00:00
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
Anton Gorenko	66d6714376	[rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd) [CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388) ## Motivation Improve precision of mxfp4 without performance penalties. ## Technical Details Since performance of scale MFMAs is the same when neither A nor B is fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the second GEMM, while types of Q, K, V stay the same. This allows to improve overall precision significantly because fp6 has 32 non-negative values used for P quantization compared to just 8 values for fp4. It was found that there is a compiler bug with `__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in LCOMPILER-561) but a workaround seems to fix all failing instances. ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-26 06:55:17 -07:00
Yung-sheng Tu	760f9e1d0a	[rocm-libraries] ROCm/rocm-libraries#7104 (commit 0fab8d8) [CK TILE] Unification Work – Add MFMA specialisations for `fp64_t` (#7104) ## Motivation This PR adds two specialisations related to `fp64_t`. ## Technical Details This adds two new specialisations for MFMA dense builtins, and adjusts ABLayout and CLayout to L{K1BM} and L{M1BN}. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-26 10:49:36 +00:00
Michal Kulikowski	8de4cb72fb	[rocm-libraries] direct push (commit 49b73ad) [CK][CK_TILE] POC for Instruction Cache prefetch. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2026-05-25 11:26:26 +02:00
Wojciech Laskowski	3ea9ce7e37	[rocm-libraries] ROCm/rocm-libraries#6567 (commit 753c7a8) [CK Tile] Adding WMMA wrappers for sparse builtins (#6567) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the third of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801 and https://github.com/ROCm/rocm-libraries/pull/6014) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse WMMA intrinsics. ## Technical Details This change adds new specializations for WMMA sparse builtins. In total, we add 8 WMMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-22 13:34:33 +02:00
Illia Silin	e02c566795	[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24) [CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2026-05-22 02:43:50 +00:00
JiaLuo-CAN	5ff7497fa7	[rocm-libraries] ROCm/rocm-libraries#7537 (commit 07123f4) [CK Tile] Fix Grouped Gemm quant mixed precision (#7537) <Migrate from Internal repo PR> test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8 cases: std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant, False, True, False>, std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant, False, True, False> GFX1250 would fail with incorrect results, GFX950 would fail when compiling BF8+FP8 and give incorrect results for FP8+BF8. The issue is due to the wrong ComputeDataType selection. The fix is to consider original ADataType and BDataType even when ComputeDataType is not void. For compiling error on gfx950, the bf8, fp8, 16x16x32 warp Gemm is added.	2026-05-21 08:36:23 -07:00
JP-Fernando	e7798e9560	[rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac) Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112) ## Motivation The goal of this work is to apply XOR shuffle (swizzle) to the current `comp_async` GEMM pipeline and the `gemm_mx` pipeline. XOR swizzling has been helpful to avoid LDS bank conflicts, as data are redistributed across LDS banks, such that simultaneous threads accessing different rows land on different LDS banks. ## Technical Details A similar approach to the work in the existing eight-waves pipeline was followed. Currently, XOR swizzle support is available for FP8 and BF8 types. FP4 support is also available for MX GEMM. Should the types not match, or should the async vector width be of an unsupported size, then the pipeline falls through to the previously existing ('unswizzled') path. ## Test Plan Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM pipeline. Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for the MX GEMM pipeline. ## Test Result The tests passed successfully in the `Alola` cluster with MI350 hardware. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-05-21 09:36:41 +02:00
Wojciech Laskowski	275629fe34	[rocm-libraries] ROCm/rocm-libraries#6014 (commit 2f8259d) [CK Tile] Adding MFMA wrappers for dense builtins (#6014) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the second of the series of PRs (after #5801) that add all the necessary MMA builtins as `amdgcn_mma` structs. This PR focuses on dense MFMA intrinsics. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we add 55 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-21 09:05:19 +02:00
JH-Leon-KIM-AMD	720ceb6500	[rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f) [CK Tile] Support multi-vector reads in static encoding patterns (#7528) ## Motivation The thread-raked / warp-raked / block-raked static tile distribution patterns in `ck_tile` silently produce wrong results when the contiguous tile dimension is larger than `warp_size * vector_size`, because the encoding has no per-thread iteration dimension along X. Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in `ConvConfigComputeV3`, the grouped convolution backward-weight example reports about 50 percent wrong values, with errors starting exactly at the `X0X1 = 64` boundary. The second pass over the contiguous dim is never performed. This PR extends the encoding so multi-vector reads in the contiguous tile dimension are supported, while keeping every existing call site bit-for-bit identical. ## Technical Details Three files changed. ### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp` Add a per-thread X iteration dimension in all three raked specializations: - `X0 = min(warp_size, XPerTile / X1)` — threads in X dim - `X1 = min(LargestVec, VecSize)` — vector size per access - `X2 = XPerTile / (X0 X1)` — number of X-iters per thread (new) `X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both `make_2d_static_tile_distribution()` and `make_shuffled_2d_static_tile_distribution()`. The new encoding places `X2` in the middle of the Ys iteration list, which preserves reverse symmetry between the regular `<..., X2, X1>` and shuffled `<X1, X2, ...>` encodings. Patterns updated: `thread_raked`, `warp_raked`, `block_raked`. ### 2. `include/ck_tile/core/tensor/transpose_tile.hpp` Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch alongside the existing `NDimY == 2` branch. The original branch is byte-for-byte unchanged. Both branches dispatch to the same `transpose_tile2d_impl_in_thread`, whose body has always been NDimY-generic (iterates with `static_for<0, NDimY, 1>` and `number<NDimY>{}`). ### 3. `experimental/grouped_convolution_tile_instances/generate_instances.py` Removed the two now-obsolete skip guards in `parse_bwd_weight_instances` and `parse_bwd_data_instances`: ```python if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector): print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.") continue ``` Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating, irregular-load shapes, scalar-per-vector > tile size) are kept untouched. ### Compatibility Strict. Every existing caller has `X2 == 1` and therefore hits the original encoding path verbatim. No upstream config or pipeline behavior changes. ## Test Plan The grouped convolution example is the natural exerciser since `GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for both A and B tiles, and all three conv directions share the same `ConvConfigComputeV3`. For each test below we ran: ``` ./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}] ``` with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both code paths: \| Test \| M / N / K \| VecA/B/C \| A path \| B path \| dtype \| \|------\|-------------\|----------\|------------\|----------------\|-------------\| \| T1 \| 16/64/32 \| 4/8/4 \| old (X2=1) \| old (X2=1) \| fp16 \| \| T2 \| 128/128/64 \| 2/2/2 \| old (X2=1) \| old (X2=1) \| fp16 \| \| T3 \| 256/256/64 \| 1/1/1 \| old (X2=1) \| new (X2=4) \| fp16 \| \| T5 \| 256/256/64 \| 1/1/1 \| old (X2=1) \| new (X2=4) \| fp16 (3 dir)\| \| T4b \| 128/128/128 \| 1/1/1 \| new (X2=2) \| new (X2=2) \| fp16 + bf16 (3 dir) \| A larger T4a (256/256/128) was attempted to stress both A and B with X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128 KB > 64 KB limit), independent of this PR. For the generator change we ran: ``` python3 generate_instances.py --mode profiler --direction all ``` and verified `Skipping instance ... with multiple warps per continous tile dim` no longer appears (count went from non-zero to 0); other skip categories are unchanged. `clang-format-18` was applied to both modified `.hpp` files (matches the repo's `.clang-format`). ## Test Result - T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`. Confirms existing callers are unaffected. - T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3 encoding + transpose branch. - T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3 `correct`. - T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs `correct`. - Generator: 0 `multiple warps per continous tile dim` skips remaining; other skips unchanged. Sample run output (T4b, bf16, bwd_data): ``` shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32 pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0 Vector size A: 1, Vector size B: 1, Vector size C: 1 0.934907 ms, 8.34683 TFlops, 34.3178 GB/s Relative error threshold: 0.00390625 Absolute error threshold: 0.25 The CPU verification result is: correct ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-20 17:25:22 +03:00
Kiefer van Teutem	b5f8bef97f	[rocm-libraries] ROCm/rocm-libraries#6088 (commit 6ac353c) [CK Tile][MFMA/WMMA unification] Add support for packed datatypes (tiny types) (#6088) ## Motivation This MR makes all the changes required for the unified architecture to be able to deal with packed datatypes i.e. int4, fp4, fp6, and bf6. The crux is that layout parameters should be interpreted as describing the pure mathematical matrix fragments, while the ext_vectors and tile distribution encodings describe everything in terms of packed datatype units. This matches how packed types are dealt with in ck_tile and should play nicely with the load and store tile ops once we integrate the unified framework into CK tile. The bf6 datatype was added to CK tile in the form of pk_bf6x16_t and pk_bf6x32_t, which did not exist before. The ext_vector implementations of pk_fp6x16_t and pk_bf6x16_t (vec size 1 and 2) were extended to make the subscripting operator work as expected. The layout test was adapted to be compatible with all packed datatypes, and all new intrinsics were added to the test. This MR adds ALL intrinsics across ALL architectures which use packed datatypes, as well as ALL scale intrinsics: mfma_scale_f32_16x16x128_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6, BF6xBF6) mfma_scale_f32_32x32x64_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6, BF6xBF6) wmma_i32_16x16x16_iu4_w32 wmma_i32_16x16x16_iu4_w32_gfx12 wmma_i32_16x16x32_iu4_w32_gfx12 ## Testing All intrinsics were tested on all architectures.	2026-05-20 12:36:13 +00:00
Aviral Goel	458dd0ac4c	[rocm-libraries] ROCm/rocm-libraries#7130 (commit 9e1e065) [CK_TILE] Redesign LDS store API with pre-computed window coordinates (+15% MI355X, +6% MI300X) (#7130) ## Summary - Redesign the LDS store API to separate window creation from memory transfer - Add `MakeDistributedLdsStoreWindow` factory, `LocalStore` (fast path), and `LocalStoreWithCoordRecompute` (slow path) to the pipeline base class - Convert CompV3 as the reference implementation - Document the slow/fast path distinction across core tensor headers ## Motivation `LocalPrefill` hides a performance cliff: when given a bare `tile_window_with_static_lengths`, it silently reconstructs `tile_window_with_static_distribution` on every call — paying significant VALU overhead (~96 for typical configurations) for XOR coordinate computation. The cost is invisible at the call site. The new API makes the cost explicit via three verbs: \| Verb \| Method \| Cost \| When to use \| \|------\|--------\|------\|-------------\| \| Create \| `MakeDistributedLdsStoreWindow(bare, dstr)` \| VALU (once) \| Before hot loop, when VGPR budget allows \| \| Store (fast) \| `LocalStore(precomputed_window, tensor)` \| 0 VALU for coords \| Pre-computed window available \| \| Store (on-the-fly) \| `LocalStoreWithCoordRecompute(bare, tensor)` \| VALU per call \| VGPR budget tight, or one-shot stores \| Both `LocalStore` and `LocalStoreWithCoordRecompute` enforce correct window types via `static_assert`. `LocalPrefill` is retained for backward compatibility (69 call sites across 6 pipeline files). ## Performance ### 86 Shapes, CompV3_2 (128×128 tile), fp16, RCR layout gfx942 (MI300X): 86/86 improved, 0 regressions. Average gain: +6.2% gfx950 (MI355X): 85/86 improved, 1 neutral, 0 regressions. Average gain: ~+15% <img width="2777" height="1178" alt="pr7130_perf_chart" src="https://github.com/user-attachments/assets/b2f5c406-eb20-469d-8da6-dd608c28fbcc" /> \| Shape (MxNxK) \| Source \| gfx942 \| gfx950 \| \|---\|---\|---\|---\| \| 22016x256x4096 \| llama2_7b_fc1 \| +5.3% \| +11.4% \| \| 22016x512x4096 \| llama2_7b_pfill \| +5.9% \| +10.9% \| \| 4096x512x22016 \| llama2_7b_pfill \| +7.6% \| +28.5% \| \| 22016x1024x4096 \| llama2_7b_pfill \| +6.1% \| +10.1% \| \| 4096x1024x22016 \| llama2_7b_pfill \| +7.4% \| +17.2% \| \| 22016x4096x4096 \| llama2_7b_pfill \| +5.2% \| +9.3% \| \| 4096x4096x22016 \| llama2_7b_pfill \| +6.0% \| +9.3% \| \| 4096x4096x4096 \| llama2_7b_pfill \| +5.7% \| +10.6% \| \| 28672x256x4096 \| llama3_8b_fc1 \| +5.4% \| +12.2% \| \| 28672x512x4096 \| llama3_8b_pfill \| +4.9% \| +6.4% \| \| 4096x512x28672 \| llama3_8b_pfill \| +7.4% \| +1.5% \| \| 28672x2048x4096 \| llama3_8b_pfill \| +4.9% \| +8.6% \| \| 4096x2048x28672 \| llama3_8b_pfill \| +6.4% \| +8.4% \| \| 28672x8192x4096 \| llama3_8b_pfill \| +5.4% \| +8.0% \| \| 7168x1024x8192 \| llama70b_pfill \| +6.6% \| +10.8% \| \| 8192x1024x7168 \| llama70b_pfill \| +6.4% \| +11.4% \| \| 7168x4096x8192 \| llama70b_pfill \| +6.2% \| +9.6% \| \| 16384x256x4096 \| bloom_fc1 \| +6.4% \| +20.3% \| \| 16384x512x4096 \| bloom_fc1 \| +5.8% \| +8.5% \| \| 16384x1024x4096 \| bloom_fc1 \| +6.0% \| +10.9% \| \| 16384x2048x4096 \| bloom_fc1 \| +5.3% \| +10.1% \| \| 16384x3072x4096 \| bloom_fc1 \| +5.5% \| +8.8% \| \| 16384x4096x4096 \| bloom_fc1 \| +5.7% \| +8.8% \| \| 4096x256x16384 \| bloom_fc2 \| +7.8% \| +33.6% \| \| 4096x512x16384 \| bloom_fc2 \| +7.5% \| +31.6% \| \| 4096x1024x16384 \| bloom_fc2 \| +7.1% \| +17.1% \| \| 4096x2048x16384 \| bloom_fc2 \| +6.9% \| +11.0% \| \| 4096x3072x16384 \| bloom_fc2 \| +6.8% \| +11.0% \| \| 4096x4096x16384 \| bloom_fc2 \| +6.7% \| +10.3% \| \| 12288x256x4096 \| bloom_inproj \| +6.7% \| +22.0% \| \| 12288x512x4096 \| bloom_inproj \| +6.2% \| +9.8% \| \| 12288x1024x4096 \| bloom_inproj \| +5.9% \| +12.4% \| \| 12288x2048x4096 \| bloom_inproj \| +5.8% \| +10.1% \| \| 12288x3072x4096 \| bloom_inproj \| +5.4% \| +10.1% \| \| 12288x4096x4096 \| bloom_inproj \| +5.7% \| +9.1% \| \| 250880x256x4096 \| bloom_logits \| +2.6% \| +0.5% \| \| 4096x256x4096 \| bloom_outproj \| +7.1% \| +28.4% \| \| 4096x512x4096 \| bloom_outproj \| +6.8% \| +27.4% \| \| 4096x1024x4096 \| bloom_outproj \| +6.5% \| +21.3% \| \| 4096x2048x4096 \| bloom_outproj \| +5.9% \| +13.1% \| \| 4096x3072x4096 \| bloom_outproj \| +5.9% \| +12.0% \| \| 16x1536x7168 \| deepseek \| +7.7% \| +34.7% \| \| 32x1536x7168 \| deepseek \| +7.7% \| +34.9% \| \| 64x1536x7168 \| deepseek \| +7.6% \| +31.3% \| \| 128x1536x7168 \| deepseek \| +7.6% \| +25.8% \| \| 256x1536x7168 \| deepseek \| +7.7% \| +27.9% \| \| 512x1536x7168 \| deepseek \| +7.6% \| +29.1% \| \| 1024x1536x7168 \| deepseek \| +7.3% \| +28.8% \| \| 2048x1536x7168 \| deepseek \| +6.9% \| +20.5% \| \| 4096x1536x7168 \| deepseek \| +6.3% \| +11.0% \| \| 8192x1536x7168 \| deepseek \| +6.2% \| +11.3% \| \| 16384x1536x7168 \| deepseek \| +6.0% \| +9.1% \| \| 20480x1536x7168 \| deepseek \| +4.8% \| +9.3% \| \| 16x3072x1536 \| deepseek \| +6.3% \| +25.1% \| \| 32x3072x1536 \| deepseek \| +6.4% \| +25.3% \| \| 64x3072x1536 \| deepseek \| +6.4% \| +24.8% \| \| 1024x1024x1024 \| square \| +5.5% \| +18.7% \| \| 2048x2048x2048 \| square \| +6.0% \| +19.2% \| \| 3584x3584x3584 \| square \| +5.3% \| +11.2% \| \| 5120x5120x5120 \| square \| +6.1% \| +10.0% \| \| 6144x6144x6144 \| square \| +5.5% \| +9.8% \| \| 8192x8192x8192 \| square \| +6.0% \| +8.2% \| \| 1024x4608x1024 \| midsize \| +4.6% \| +4.6% \| \| 512x18432x512 \| midsize \| +1.9% \| +10.1% \| \| 4096x18432x4096 \| midsize \| +5.8% \| +8.8% \| \| 320x8192x320 \| stablediff \| +4.0% \| +11.3% \| \| 640x2048x640 \| stablediff \| +4.5% \| +14.0% \| \| 320x8192x1280 \| stablediff \| +5.6% \| +20.1% \| \| 1x1280x8192 \| skinny_m1 \| +7.7% \| +35.3% \| \| 1x8192x1024 \| skinny_m1 \| +6.0% \| +20.3% \| \| 1x7168x8192 \| skinny_m1 \| +7.7% \| +36.6% \| \| 1x8192x3584 \| skinny_m1 \| +7.3% \| +27.9% \| \| 1x13312x6656 \| skinny_m1 \| +7.6% \| +30.3% \| \| 1x13312x16384 \| skinny_m1 \| +7.8% \| +4.2% \| \| 1x16384x6656 \| skinny_m1 \| +7.5% \| +28.7% \| \| 1x16384x16384 \| skinny_m1 \| +7.7% \| +2.3% \| \| 16x4096x4096 \| skinny_m16 \| +7.4% \| +31.9% \| \| 16x22016x4096 \| skinny_m16 \| +7.5% \| +26.5% \| \| 16x28672x4096 \| skinny_m16 \| +7.0% \| +15.1% \| \| 16384x1280x8192 \| skinny_m16 \| +5.6% \| +8.7% \| \| 16384x8192x1024 \| skinny_m16 \| +4.5% \| +8.8% \| \| 2048x4096x2048 \| mixed \| +4.7% \| +9.0% \| \| 4096x2048x8192 \| mixed \| +6.8% \| +11.0% \| \| 8192x4096x4096 \| mixed \| +5.2% \| +10.0% \| \| 1x4096x4096 \| mixed \| +7.4% \| +32.4% \| \| 1024x1024x4096 \| mixed \| +7.1% \| +27.4% \| ### ISA Hot Loop Diff (LBB1_32, per K-iteration, gfx942) \| Metric \| Baseline \| Optimized \| Delta \| \|--------\|----------\|-----------\|-------\| \| Total VALU \| 621 \| 500 \| -121 \| \| VGPR / SGPR \| 512 / 96 \| 512 / 96 \| unchanged \| ### Hardware Counters — Instruction Mix (gfx950, rocprofiler-compute) Profiled on MI350X, shape 4096×256×16384 (bloom_fc2). Instruction counts are deterministic hardware counters. \| Metric \| Baseline \| Optimized \| Δ \| \|--------\|----------\|-----------\|---\| \| VALU instructions/kernel \| 4,642,473 \| 987,958 \| −78.7% \| \| INT32 VALU \| 2,592,786 \| 541,129 \| −79.1% \| \| Instructions / wavefront \| 39,178 \| 24,400 \| −37.7% \| \| VGPRs (avg) \| 98 \| 90 \| −8% \| \| MFMA instructions \| 2,059,702 \| 2,059,702 \| 0% \| \| LDS instructions \| 1,564,891 \| 1,564,891 \| 0% \| \| VMEM instructions \| 520,996 \| 520,996 \| 0% \| MFMA as fraction of total instructions: 30.7% → 67.5%. Eliminating ~3.65M redundant INT32 VALU instructions (XOR coordinate recomputation per K-iteration) leaves the scheduler more headroom for MFMA dispatch, directly explaining the benchmark gains.	2026-05-19 20:22:37 -04:00
Enrico Degregori	9565ca21ec	[rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2) [CK Tile] Eight Waves pipeline for MX GEMM (#5552) ## Motivation Integrate Eight Waves pipeline in MX GEMM ## Technical Details - EightWaves pipeline: - Add pipeline, policy and block gemm (internally using existing implementation used by GEMM and ABQuant) - Extend support of EightWaves policy for FP4 (packed types) - Async pipeline: - Fix pipeline with packed scales (requires MRepeat and NRepeat to be contiguous) - block gemm specific for MX GEMM is defined because distribution encodings have changed - CShuffle: - Add new functionality to support MRepeat and NRepeat contiguous (defined by `TilesPacked`) - Examples: - Refactor examples to easily switch different configurations (similar to GEMM universal) - Scales values generated consistently with other microscale implementations in CK Tile - Add configuration for EightWaves pipeline - Tests: - Unify existing FP8 and FP4 tests - Add tests for EightWaves pipeline - Scales values generated consistently with other microscale implementations in CK Tile Note: FP6 support for MX GEMM was added later and the support for the Eight Waves pipeline will be done in following PR ## Test Plan Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and FP8 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-19 11:53:19 -07:00
Yung-sheng Tu	5169cd14a1	[rocm-libraries] ROCm/rocm-libraries#7543 (commit 2b735ff) Fix for #6207 (#7543) ## Motivation PR #6207 introduces an error. This PR is the fix of it. ## Technical Details Adds a path for GFX1250 in `to_string` ## Test Plan Test has already included. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-19 00:54:46 +00:00
Yung-sheng Tu	3ccb72e761	[rocm-libraries] ROCm/rocm-libraries#6207 (commit cc56378) [CK TILE] Unification Work – Add `print()` Utility to `MmaOpTraits` (#6207) ## Motivation It would be useful to have a `print()` utility inside of unification work's code scope, so that we can print all template params and derived params of `amdgcn_mma` for easier debugging. ## Technical Details Adding helper functions and struct to traits, adding `print_flags()` for each `DefaultCtrlFlags`, `amdgcn_target` and `MmaOpTraits` structs, and adding `print()` for `amdgcn_mma`. Note: the first commit is not* in the scope of this PR. This PR should be merged after https://github.com/ROCm/rocm-libraries/pull/5801 and https://github.com/ROCm/rocm-libraries/pull/5857. ## Test Plan Adding test in layout test. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-18 13:02:38 +02:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Illia Silin	22b9feb40f	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings (#7111) ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 07:14:14 -07:00
Wojciech Laskowski	640bd560ec	[rocm-libraries] ROCm/rocm-libraries#5801 (commit 27f6d15) [CK Tile] Adding WMMA wrappers for dense builtins (#5801) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the first of the series of PRs that add all the necessary MMA builtins as a `amdgcn_mma` structs. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we have now 9 RDNA4 builtins and 3 RDNA3 builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yung-sheng Tu <yung-sheng@streamhpc.com>	2026-04-27 11:57:51 +00:00
Jeff Huang	08097fa515	[rocm-libraries] ROCm/rocm-libraries#6653 (commit 1df887e) [CK_TILE] fix(fmha): support >2GB KV cache in batch prefill via template dispatch (#6653) ## Motivation The CK batch prefill kernel previously failed (silent overflow + page faults) when the KV cache exceeded 2 GB, blocking long-context inference workloads (e.g., 128K+ token contexts with paged KV). Two distinct failure modes were addressed: 1. >4GB SRD overflow (`page_size < kN0`): The SRD `buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small page sizes the rebased SRD spans the full KV pool and the offset wraps past 2 GB, corrupting K/V loads. 2. gfx950 page-table fault (`page_size >= kN0`): On CDNA4 the hardware validates the full SRD `num_records` range against page-table permissions (CDNA3 only checks per-instruction `voffset`). After per-tile SRD rebase, an un-trimmed `num_records` field extends past the live page and faults on freed/protected memory. ## Technical Details Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad` template parameter: \| Case \| `page_size` \| KV cache size \| Mode \| Load path \| Addressing \| \|---\|---\|---\|---\|---\|---\| \| 1 \| `>= kN0` (large pages) \| any \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, bounded by per-page rebase \| \| 2 \| `< kN0` (small pages) \| `<= 2 GB` \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, fits in INT32 byte range \| \| 3 \| `< kN0` (small pages) \| `> 2 GB` \| Global-load (`kUseGlobalLoad=true`) \| `async_load_tile_raw_flat` (K) + `load_tile_flat` (V) \| 64-bit \| Dispatch: the auto-gen API layer (`fmha_batch_prefill.py`) selects the kernel instantiation at launch from `(page_block_size, num_total_pages * batch_stride_k * kElementBytes)`, so the small-page penalty is paid only when correctness requires it. gfx950 SRD `num_records` trimming: in the K and V rebase lambdas of `block_fmha_batch_prefill_pipeline_qr_ks_vs_async`, `set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after each rebase to constrain `num_records` to the live page. Required for CDNA4 page-table validation; harmless on CDNA3. Pipeline sync for the global-load path: - V uses synchronous `load_tile_flat`; K uses `async_load_tile_raw_flat`. - `v_physical_pages_current` is double-buffered so the V flat load doesn't race against the next iteration's K rebase computation. Arch guards: `global_load_lds` intrinsics are gated to `__gfx94__` / `__gfx950__` (CDNA3+). Other architectures hit a `dependent_false` static_assert with a descriptive message. Device-side assertion convention: SRD setters use `__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s `assert()`. The latter introduces an `__assert_fail` call whose register pressure scatters the K-SRD scalar register window across conditional branches, corrupting `buffer_load_dwordx4` on gfx950. ## Test Plan Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper test suite. All coverage lives in `op_tests/test_batch_prefill.py`: - Functional matrix (96 cases) — `test_batch_prefill`: `page_size ∈ {1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16, fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1, 4}` (parametrized to exercise per-sequence SRD rebase across batch boundaries). - >2 GB coverage — `test_batch_prefill_large_kvcache`: extended to allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true` (small-page) and `kUseGlobalLoad=false` (large-page rebase) paths. Includes both single-batch and multi-batch (`batch_size=4`) cases to exercise per-sequence SRD rebase across the >2 GB pool. - Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol` from the existing batch prefill test harness. ## Test Result \| Arch \| `test_batch_prefill` \| `test_batch_prefill_large_kvcache` (>2 GB) \| \|------\|----------------------\|---------------------\| \| MI308 (gfx942) \| All passed \| Passed \| \| MI355 (gfx950) \| All passed \| Passed \| Performance impact (gfx950, hot SRD path): - +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang / causal / soft_cap=30`, attributable in full to the two `set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas (5-run median, signal/noise ≈ 9×). - This cost is mandatory for gfx950 correctness on >2 GB workloads — removing the setters re-introduces page-faults. - gfx942: 0 regressions in the same range (all configs ≤ +0.97%). ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-24 07:08:41 +08:00
Illia Silin	d16061f578	[rocm-libraries] ROCm/rocm-libraries#6550 (commit c396de9) [CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
Hosang Yoon	720cc88a31	[rocm-libraries] ROCm/rocm-libraries#6253 (commit 61934c6) [CK_TILE] Enable canonical-NaN BF16 conversion for FMHA on RDNA (#6253) ## Motivation - On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA forward adds noticeable overhead and causes a meaningful performance gap versus fp16. The asm-based path (mode 3) does not improve this on RDNA and can perform even worse. - In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower than the corresponding fp16 path. - This PR reduces that gap by switching FMHA forward to a different BF16 conversion strategy based on Triton’s canonical-NaN round-to-nearest-even behavior. ## Technical Details - Add a new `standard_cnan` BF16 conversion mode to CK Tile. - Implement a canonical-NaN RTN `float -> bf16` conversion path based on the Triton implementation. - Enable this conversion mode by default for FMHA forward builds targeting gfx11/gfx12. - Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some `hdim=128` cases to keep kernel selection aligned with the updated conversion behavior. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - all tests passed when running `test_ck_tile_fmha` - BF16 FMHA forward performance improves by up to ~5% on gfx11. - BF16 FMHA forward performance improves by up to ~10% on gfx12. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:52:24 -04:00
Bartłomiej Kocot	6f9537fa0b	[rocm-libraries] ROCm/rocm-libraries#6168 (commit 2968835) [CK][CK Tile] Clamp element space size to max int32 value (#6168) ## Motivation Fix oob check by clamping element space size to avoid overflow when tensor is larger than 2GB. ## Technical Details - It is possible that tensor could be larger than 2GB but offsets no, so element space size must be clamped to 2GB if value is larger. ## Test Plan CI ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. https://github.com/ROCm/composable_kernel/issues/3722 Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-04-20 15:32:24 +00:00
Yung-sheng Tu	5d36cad34a	[rocm-libraries] ROCm/rocm-libraries#5857 (commit d77cd41) [CK TILE] Unification of Scale MFMA/WMMA Policy Structs (#5857) ## Motivation The existing unification work supports DENSE and SPARSE intrinsics. In this PR, we enable support for SCALE intrinsics and add example SCALE implementations. ## Technical Details Adding MFMA SCALE intrinsics support, adding tests for MFMA SCALE intrinsics, and adding WMMA SCALE policy trait. Note: fp6 SCALE intrinsics support is not included in this PR, as its handling in ck_tile is currently more specialized and does not follow the same pattern as other datatypes. ## Test Plan Added new tests for the relevant SCALE specialisations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:28:23 +00:00
Po Yen Chen	0d53e3674b	[rocm-libraries] ROCm/rocm-libraries#6342 (commit 31bcb51) [CK] Skip fp16 dropout d256 batch tests for compiler VGPR aliasing bug (#6342) ## Summary - Skip fp16 FMHA forward dropout tests that use the d256 tile in batch mode, gated on compiler version - The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of Philox RNG parameters under high register pressure (383 VGPRs) - bf16 dropout tests are unaffected and cover the same code paths ## Root Cause The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state stored in VGPRs) with other live data during the softmax main loop. This causes corrupted `buffer_store_byte` writes for dropout randval on wave lanes 32-63, producing NaN in output and LSE tensors. Conditions: fp16 + d256 tile + dropout + batch mode + `qr` pipeline + gfx90a ## Changes - `include/ck_tile/core/config.hpp`: Add `CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro - `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in `TEST_P(Dropout, ...)` ## Test plan - [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip active) - [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED - [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)	2026-04-14 14:07:20 +00:00
chris-tsiaousis-hpc	1c04029c0e	[rocm-libraries] ROCm/rocm-libraries#5508 (commit 0ad0aca) [CK Tile] Unification work - mma transformations pipeline (#5508) ## Motivation In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing. For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform. A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR. The pipeline base class is basically at the RFC stage. We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline. ## Technical Details The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed. ## Test Plan New tests are added. ## Test Result Tests should pass. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-04-14 09:25:01 +02:00
Brock Hargreaves	a4e36c1b89	[rocm-libraries] ROCm/rocm-libraries#6400 (commit c0b3c95) [MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400) Reverts ROCm/rocm-libraries#6343 This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.	2026-04-13 20:46:07 -06:00
Ville Pietilä	0f2279920b	[rocm-libraries] ROCm/rocm-libraries#6343 (commit 3604475) [CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343) ## Motivation Due to compiler version update, there are test failures in the test suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current `develop` branch, the test failures are blocking any PR merges into `develop`. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. ## Technical Details This PR disables the compilation of the problematic bwd weight conv instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that allows us to detect when we are compiling for `gfx90a`. Using the new CMake flag, compilation of instance `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`. Co-authored-by: Ville Pietilä <>	2026-04-13 13:40:27 +02:00
Kiefer van Teutem	b7156a8bbf	[rocm-libraries] ROCm/rocm-libraries#5515 (commit 40f66c3) [CK Tile] Add Tile Distribution Encoding Calculator (#5515) ## Motivation We want to be able to calculate TileDistributionEncodings describing register mappings for any MmaOp. This is necessary for further integration with CK Tile. This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma type (MmaOp) and provides ABC warp distribution encodings for mapping matrix fragment coordinates to register coordinates (lane, vector item) and vice versa. It is able to take CTranpose, Swizzle, and NumAccessA / NumAccessB template parameters for tweaking the tile distributions. Swizzle modification will be implemented later. The current implementation can deal with all intrinsic types and block-hiding. This MR also adds some additional static asserts and derived params within amdgcn_mma_base, to enforce consistency and help calculate Tile Distributions for block-hiding intrinsics. An Example was added that uses the Tile Distr Enc Calc to calc and print register layouts for Tile Distributions for some of our amdgcn_mma structs. It also makes sure that the CTranspose modifier works as intended. Some additional gfx9 intrinsics were added to test block-hiding layouts for the different types of C-block-hiding layouts. The sparse intrinsic wrappers were updated according to Chris's recent changes in another branch (https://github.com/ROCm/rocm-libraries/pull/5508), which moved the compression step outside of the intrinsic itself. This is necessary to make sure that the Calculator can deal with this new interpretation of the sparse intrinsics. I directly copied the new amdgcn structs from Chris's branch and changed nothing else to avoid more complex merges in the future. Note that this means I did not update a bunch of related sparse code since that would be a lot, and therefore I disabled test_amdgcn_sparse_mma for now. The amdgcn_mma_layout test was refactored a bit: - The old register mapping utility was removed and its use was replaced by the new TileDistrEncCalc - More tests were added to test layouts for different types of block-hiding and sparse intrinsics - The Selector method was removed and the tests were split up over target architectures, with each target arch having a direct list of amdgcn structs to be tested. This ensures that we force specific tests on specific architectures and makes sure that the selector doesn't quietly do some workarounds like creating compound intrinsics. ## Test Results Layout tests based on calculated tile distribution encodings pass on all architectures. Calculator works for all currently added amdgcn structs, which includes different types of block-hiding and sparse intrinsics. Printed layouts from new example verified by eye. CTranspose modifier tested for large set of intrinsics.	2026-04-13 08:00:31 +00:00
Aviral Goel	c7eb33078c	[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8) CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302) Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 11:17:11 -04:00

1 2 3 4 5 ...

303 Commits