composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 00:39:02 +00:00

Author	SHA1	Message	Date
chris-tsiaousis-hpc	db05d61136	[rocm-libraries] ROCm/rocm-libraries#6212 (commit ccee58d) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR solves several issues: #### More accurate tests for MmaPipelines The current tests for the MmaPipelines (test_amdgcn_sparse_mma, test_amdgcn_wavewise_mma) use explicit input fragment vectors filled with 1s, and only check the output of a single lane. We should have tests that actually use the MmaPipelines with non-trivial input matrices and verify the complete output. Some other aspects of the current MmaPipelines tests that I noticed and deserve some attention: 1. There is sometimes iteration over K outside of the pipeline, which is then included in WaveTileK or FragK, which is not correct. We should remove it, move K iteration inside of the pipeline, or be more clear about this outer-K loop size and how it propagates downwards. 2. There is very tight coupling between the kernel, gtest code, and test_pipeline helper, requiring a lot of information and functions to be passed back and forth. 3. The test_pipeline helper is doing a bunch of register-related logic on the host (related to point 1) 4. Without this register logic the only thing it does is check the device, call the kernel, and check the output, but with a lot of boilerplate. #### Test helper for detecting target arch at HOST runtime There is a really apparent issue we faced while writing tests: Scenario: 1. Compile a test that supports both gfx950 and gfx1201 for gfx950 2. Run the test on a server that only has gfx1201 GPU Actual: Segmentation fault Expected: The test can correctly detect from HOST runtime that the DEVICE target_id was different and skips the test. Notes: The only way of detecting the COMPILER_TARGET_ID in the existing "arch" framework is launching a kernel and calling `get_compiler_target()` (so, from a DEVICE code). This will create a segmentation fault if the current arch differs from the target arch. To cope with this issue, we propose to export the compiler target(s) (note they can be many) through `projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and define a test helper to deal with such cases. #### Add composition support to Transforms We have a small number of Transforms which act on MmaOp input and output data, before and after the MmaOp call respectively. These are currently implemented to work on an MmaTile level, but in theory they are also supposed to work at a WaveTile level, i.e. after composition of multiple MmaTiles to create larger effective MNK dimensions. Currently the composed MmaTiles look like 2D C-style arrays of the individual MmaTile level register vectors (see WaveWiseMmaPipeline). The transforms should be able to take these and perform the proper transforms to the whole WaveTile at once. This might allow for better performing transformations. Note: This PR handles the SparseTransform case and if we don't end up doing scale as a transformation, there isn't really much left to do. If we end up having only the sparse transform as a non-trivial transform, then we could also consider removing the Transform framework.	2026-06-03 14:35:18 +00:00
Ville Pietilä	88f8d24c34	[rocm-libraries] ROCm/rocm-libraries#7936 (commit 3dc91e6) [CK Tile] Fix V6 pipeline applicability and split-image initialization (#7936) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation After adding code generation via CK Tile Dispatcher, some fwd and bwd weight tests for CK Tile convolutions are failing. This PR introduced correct applicability checks and fixes the split-image parameter initialization such that non-applicable instances are not invoked during test execution and split-image instances are correctly initialized. ## Technical Details Investigation revealed two distinct problems 1. For bwd weight, the compute V3 uses prefetch of 3 distinct tiles, which works incorrectly when the number of K-slices addressed by the workgroup is 1. This occurs when a large split-K value is used for a problem that results in a small Gemm-K value. 2. For fwd direction, the current CK Profiler/test infrastructure doesn't initialize the split-image parameters for instance where split-image is enable. Uninitialized split-image values result in non-deterministic behavior where the tests might randomly fail. Fixed problem 1. by adding a check in `IsSupportedArgument` that marks the instance invalid if the `num_loops = ceil(GemmK / (k_batch * KPerBlock)) < 4` for V6 pipeline kernel instances. The check is compile-time eliminated for other kernels. Fixed problem 2. by adding initialization of split-image parameters when split-image is enabled. The default initialization corresponds to full image with no split, i.e., the number of splits is 1 and it has the size of the full image. Added unit tests for the added logic. ## Test Plan Running the following test suites cover the logic added in this PR - test_grouped_convnd_fwd_tile - test_ck_tile_grouped_conv_fwd - test_grouped_convnd_bwd_weight_tile - test_ck_tile_grouped_conv_bwd_weight All test suites above are included in the automated test runs. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 08:40:03 +00:00
Anton Gorenko	7ecbf82708	[rocm-libraries] ROCm/rocm-libraries#7500 (commit f5cd4fd) [CK_TILE][FMHA] Optimize long-context decoding on gfx11/12 (#7500) ## Motivation Relevant issue: ROCM-22065 FMHA has less-than-optimal performance of long-context decoding (i.e. when seqlen_q = 1) on gfx11/12. This PR optimizes the splitkv pipeline and configs for such scenarios. ## Technical Details Optimizations applied in this PR: 1. use tiles with smaller M0 (16 vs 64), these tiles are used when seqlen_q <= 16 2. adapt qr_nwarp_sshuffle pipeline for gfx11, it allows to use more warps even for M0 = 16 (the qr pipeline parallelizes work between warps in M dim so with M0 = 16 it allows to use only 1 warp) 3. enable kMergeNumHeadGroupsSeqLenQ (an optimization that merges one group of heads in GQA) for all hdim values, not only 128 4. increase the number of splits (multiply by the number of head groups) if (3) is used 5. increase the number of splits for RDNAs (`multiProcessorCount` is the number of WGPs on RDNAs, not CUs, so it should be doubled to have meaning similar to CDNAs) Performance on gfx1151: \| Case \| develop (GB/s) \| This PR (GB/s) \| \|:-------\|-------:\|-------:\| \| [fp16\\|group\\|bshd] b:1, h:32/32, s:1/45056, d:64/64 \| 127.58 \| 183.11 \| \| [fp16\\|group\\|bhsd] b:1, h:32/32, s:1/45056, d:64/64 \| 153.64 \| 215.02 \| \| [fp16\\|group\\|bshd] b:1, h:16/8, s:1/77184, d:128/128 \| 120.51 \| 225.76 \| \| [fp16\\|group\\|bhsd] b:1, h:16/8, s:1/77184, d:128/128 \| 130.62 \| 223.84 \| \| [fp16\\|group\\|bshd] b:1, h:32/32, s:1/9600, d:128/128 \| 82.65 \| 138.44 \| \| [fp16\\|group\\|bhsd] b:1, h:32/32, s:1/9600, d:128/128 \| 105.75 \| 220.45 \| \| [fp16\\|group\\|bshd] b:1, h:8/1, s:1/401024, d:256/256 \| 16.27 \| 187.89 \| \| [fp16\\|group\\|bhsd] b:1, h:8/1, s:1/401024, d:256/256 \| 16.28 \| 188.19 \| ## Test Plan An additional test case is added to the exiting test. It uses seqlen_q = 1, GQA, no mask to trigger the changes ``` ninja test_ck_tile_fmha_fwd_fp16 && bin/test_ck_tile_fmha_fwd_fp16 --gtest_filter="SplitKV ninja test_ck_tile_fmha_fwd_bf16 && bin/test_ck_tile_fmha_fwd_bf16 --gtest_filter="SplitKV ``` Manual testing can be done with these commands: ``` bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=64 -s=1 -s_k=$((352 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=16 -h_k=8 -d=128 -s=1 -s_k=$((603 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=128 -s=1 -s_k=$((75 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=8 -h_k=1 -d=256 -s=1 -s_k=$((3133 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 ``` ## Test Result All the tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 06:16:10 +00:00
Illia Silin	5720589311	[rocm-libraries] ROCm/rocm-libraries#7960 (commit ddac5cf) [CK] Upgrade to new gfx1250 compiler and fix build issues (#7960) ## Motivation The docker image we've been using to build for gfx1250 is a few months old, so we need to upgrade. Some of the changes in the latest compiler version require changes in the code. TDM is temporarily disabled due to changes in the lds load/store intrinsics. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 01:58:59 +00:00
Maksim (Max) Podkorytov	d574cc4757	[rocm-libraries] ROCm/rocm-libraries#6696 (commit 9627b91) Replace nested static_for lambdas with compile-time search helper (#6696) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Add `sequence_find_value` and `find_in_tuple_of_sequences` compile-time search helpers with O(1) template depth - Replace nested `static_for` lambdas in `TensorDescriptor::GetTransformAndItsUpperDimension` and `InitializeElementSize` - Apply same optimizations to `TensorAdaptor` Supersedes #4287. Conflict-resolved rebase of ROCm/composable_kernel#3600 onto current develop. ## Motivation The `TensorDescriptor` and `TensorAdaptor` classes had excessive template instantiation from: 1. Nested `static_for` loops with lambdas creating unique closure types at every call site 2. `generate_tuple` with lambdas causing per-type instantiation overhead The new helpers use constexpr array lookup and pack expansion instead of recursive template patterns, achieving O(1) template depth. ## Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved, `-j1`, `-ftime-trace`) \| TU \| Baseline (mean) \| New (mean) \| Delta \| Wilcoxon p \| Mann-Whitney p \| \|----\|-----------------\|------------\|-------\|-----------\|---------------\| \| `grouped_conv_fwd_xdl_fp16` (host) \| 14,886 ms \| 13,353 ms \| -10.3% \| 0.002 \| 0.0002 \| \| `grouped_conv_fwd_xdl_fp16` (device) \| 27,762 ms \| 25,629 ms \| -7.7% \| 0.002 \| 0.0002 \| \| Total (all TUs) \| 57,732 ms \| 54,030 ms \| -6.4% \| \| \| Unrelated TUs (`device_memory`, `host_tensor`, `convolution_parameter`) show no significant difference (p > 0.3), serving as negative controls. ### Methodology - 10 interleaved runs (baseline₁, new₁, baseline₂, new₂, ...) on the same node to eliminate ordering/warmup bias - Wilcoxon signed-rank test (paired, non-parametric) and Mann-Whitney U test (unpaired) - Built with patched clang (LLVM 22) on ctr2-alola-compile-11, `-j1` for accurate per-TU timing - Raw data available in Slurm job 275230 results ## Test plan - [x] 11 unit tests added (5 for `sequence_find_value`, 6 for `find_in_tuple_of_sequences`) - [x] Compile-time benchmark with statistical significance (p < 0.01) - [ ] Full CI Tracking issue: #4229	2026-06-02 23:15:10 +00:00
Aviral Goel	99ab4c4ef7	[rocm-libraries] ROCm/rocm-libraries#7830 (commit 590fe58) [CK_Tile][MI450] Add bf16 output wmma instruction (16x16x32) (#7830) Wire __builtin_amdgcn_wmma_bf16_16x16x32_bf16 into CK Tile for gfx1250, enabling bf16-input bf16-output WMMA at the warp GEMM level. - Add WmmaTraits specialization for <gfx125_t, bf16, bf16, bf16, 16,16,32> - Add WarpGemmAttributeWmmaImpl typedef and WarpGemmWmma alias - Add Dispatcher entry for bf16->bf16 16x16x32 - Add warp_gemm test with reference GEMM validation ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-02 13:54:16 +00:00
Tianyuan Wu	22a99f97e8	[rocm-libraries] ROCm/rocm-libraries#7677 (commit 308af93) [CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile ## Motivation This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and improves warp GEMM unit test coverage/structure for gfx1250-specific cases. ## Technical Details - Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added IsScale16 plumbing to warp GEMM dispatcher path - Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250 coverage to verify all F4 WMMA paths ## Test Plan Run ./test_ck_tile_wg_32x16x128_fp4. ## Test Result ``` ./test_ck_tile_wg_32x16x128_fp4 [----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (1751 ms total) [ PASSED ] 3 tests. ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-30 01:28:48 +00:00
Emily Martins	95c916369c	[rocm-libraries] ROCm/rocm-libraries#7584 (commit 060bad5) [CK_TILE] Fix Stream-K k_size calculation ## Motivation In a recent benchmarking task for CK Tile Stream-K algorithm, we identified that certain instances segfault. This change works to fix the bug and adds necessary regression tests. ## Technical Details The StreamK kernel constructs tensor views using a `k_size` parameter that determines how much of the K dimension to process in each iteration. Previously, this was calculated as: ```cpp index_t k_size = num_loop_sk * TilePartitioner::KPerBlock; ``` This calculation assumes all macro tiles along K are exactly `KPerBlock` in size. However, when `K % KPerBlock != 0`, the final macro tile along K has a remainder size of `K % KPerBlock`, not a full `KPerBlock` (see the figure below): <img width="961" height="488" alt="image" src="https://github.com/user-attachments/assets/3e1cceed-5dcd-4980-8b02-cee24eecf262" /> With the old code, a workgroup working with the `MPerBlock x (K % KPerBlock)` tile in A and B risk accessing illegal memory. Hence, this change ensures that when `K % KPerBlock != 0`, workgroups processing iterations that include the final macro-tile along K calculate the correct `k_size` based on the remainder rather than assuming a full `KPerBlock`. ## Test Plan I added the following tests: 1. Unit tests added for the Stream-K Tile Partitioner: - `StreamKTilePartitionerBaseGetKSize/NoRemainderTiles` - validates full tiles - `StreamKTilePartitionerBaseGetKSize/RemainderTiles` - validates remainder handling 2. Regression tests that test a case where `K % KPerBlock != 0` ## Test Result Tests passed locally on gfx90a, gfx942, and gfx950. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-29 21:36:49 +00:00
Aviral Goel	15c904b460	[rocm-libraries] ROCm/rocm-libraries#7724 (commit 4cb149a) ck_tile: add FillUniformScaleDistribution and fix MX GEMM scale init (#7724) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary ### Problem MX GEMM pipeline tests were passing vacuously: scale bytes were drawn from a fixed range (40–60) which, for e8m0, maps to scales ≈ 10⁻²⁷ — far below FP16 min denorm. Both GPU and CPU produced all-zero outputs, so numerical checks passed without exercising the GEMM. ### Changes `include/ck_tile/host/fill.hpp` — new `FillUniformScaleDistribution<ScaleType>` functor - Accepts human-readable float bounds and maps them to the raw byte range of any ExMy scale type (e8m0, e4m3, e5m3) by re-centering the IEEE 754 exponent into the type's bias space - Sampling is uniform over raw bytes → uniform over representable values - Fixes left-shift UB: uses multiplication instead of `<< mant_bits` to avoid shifting negative signed integers (C++17 UB) - Adds `assert(min_r <= max_r)` to catch inverted-range UB when both bounds exceed the type's representable range - Provides default member values (0.125f, 2.0f) and `std::optional` seed consistent with sibling fillers - `/** /` Doxygen style with `@note` on snapping asymmetry `test/ck_tile/gemm_mx/test_mx_gemm_pipeline_util.hpp`* — fix scale initialization - Replace manual byte-range distribution with `FillUniformScaleDistribution<>{0.125f, 2.0f}` - Use distinct seeds for scale_a (11941) and scale_b (11943) to avoid correlated scale tensors that were causing 60 test failures for fp4+e5m3/e4m3 combinations `test/ck_tile/utility/test_fill.cpp` — new unit tests for `FillUniformScaleDistribution` - 16 typed tests across e8m0, e4m3, e5m3: validity, range, reproducibility, coverage, snapping, stress, nullopt seed, and range overload - Test helper `expected_raw_range` mirrors implementation clamping exactly	2026-05-29 18:45:13 +00:00
Andriy Roshchenko	d5c9215064	[rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f) [CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC ## Motivation Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async load instruction for tensor A, which complements the existing MX GEMM FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines are enabled on gfx1250. ## Technical Details The existing gfx950 implementation was extended to support gfx1250 architecture. All three MX FP data types are supported across the two ASICs. It should be noted that while the TDM pipeline uses an emulated 32x32x128 warp-tile instruction, the present submission relies on the built-in 16x16x128 instruction, called 4 times per warp. ## Test Plan Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250 functionality. To help facilitate the testing in development, `example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was introduced to verify various combinations of supported data types and pipeline versions. ## Test Result The present submission is expected to work on both gfx950 and gfx1250 hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. - [x] Relies on #6978 and should only be merged after the changes are merged to the `develop`.	2026-05-29 17:02:45 +00:00
Bartłomiej Kocot	5d912538d3	[rocm-libraries] ROCm/rocm-libraries#7847 (commit b995ef2) [CK] Remove IsPackedTensor function ## Motivation Fix codegen hipRTC ## Technical Details Remove not needed function. Since MakeArgument supports long_index_t strides. ## Test Plan Codegen tests. ## Test Result Passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-29 14:00:06 +00:00
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
Bartłomiej Kocot	60df79085d	[rocm-libraries] ROCm/rocm-libraries#7631 (commit d591a7c) [CK] Grouped Convolution Global Load/Store support (#7631) ## Motivation Grouped Convolution Global Load/Store support to cover large tensor cases. ## Technical Details Utilize global load for grouped convolution forwad kernels. Update Indexes to use int64. ## Test Plan - test utils - test conv kernels in next pr with instances ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1255 --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-27 08:21:54 +00:00
JH-Leon-KIM-AMD	00e1d82ae7	[rocm-libraries] ROCm/rocm-libraries#7732 (commit b0e29d9) [CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732) ## Motivation Fix silent miscompute in the grouped convolution backward-data kernel (`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride > dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast path that dropped all but the first sub-GEMM, producing zeroed slices of `dx` on the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness without giving up the perf gains PR #6208 delivered on stride=1 shapes. ## Technical Details - Tighten the flat-descriptor fast-path gate to require `arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its original purpose). For stride > 1, the implicit GEMM is split into `gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly; routing them through the flat path required dropping all but the first, which was the source of the bug. - Stride > 1 now falls through to the existing grouped CShuffle path, which packs all sub-GEMMs into one descriptor array and walks them on-device in a single kernel launch. This is the pre-PR-6208 production path; correctness is established and per-dispatch launch count is minimised. - Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0) intersection in `test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with `gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this intersection (all stride>1 cases used G=2; all G=1 cases used stride=1), which is why PR #6208's regression slipped past CI. ## Test Plan - `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke tier — runs on every PR via `smart_build_and_test.sh`). - End-to-end verify (`verify=1`) via `example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes including the original ALMIOPEN-1959 case and a cross-bucket (`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32` buckets. - ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate via an environment variable: full kernel-family enumeration, winning kernel + its avg_time reported under each gate. 33/41 shapes completed before the sweep was stopped; the remaining 8 were the largest i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape enumeration budget (not relevant to the verdict). ## Test Result ### Correctness \| Test \| Result \| \|---\|:---:\| \| `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D, includes 3 new regression shapes) \| 12/12 PASSED in 14.18 s \| \| `test_grouped_convnd_bwd_data_interface` (API checks) \| PASSED in 0.28 s \| \| ALMIOPEN-1959 stride=2 (`verify=1`) \| PASSED \| \| stride=1 K3 (`verify=1`) \| PASSED \| \| stride=3 K3 `gemms_count=9` (`verify=1`) \| PASSED \| \| stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) \| PASSED \| ### Performance (ckProfiler A/B on gfx942 / MI300X) Comparing the post-fix gate (flat path only when `gemms_count_==1`, column "B") vs the inner-loop variant that keeps the flat path on stride>1 (column "A") across 25 stride>1 shapes where production picks a `_v1` instance (so the gate actually fires): \| Stride \| Shapes \| A wins \| Tie \| B wins \| Notes \| \|:------:\|:------:\|:------:\|:---:\|:------:\|---\| \| 1 (sanity, gate moot) \| 3 \| 0 \| 3 \| 0 \| gate doesn't differentiate — A == B as expected \| \| > 1 (gate fires) \| 25 \| 0 \| 11 \| 14 \| B wins +6% to +32%; A never wins \| Highlights from the firing-gate cases: \| Shape (G=1, stride=2 unless noted) \| A ms \| B ms \| B vs A \| \|---\|---:\|---:\|---:\| \| ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) \| 0.183 \| 0.171 \| B +6% \| \| Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) \| 0.054 \| 0.045 \| B +17% \| \| i2v-010 (N=1, K=C=384, 3×3, 277×209) \| 0.174 \| 0.125 \| B +28% \| \| Synthetic 50×50 K3 N=32 K=C=256 \| 0.131 \| 0.088 \| B +32% \| Why B wins everywhere the gate fires: for `gemms_count = N`, the flat path needs N kernel launches (one per sub-GEMM), while the grouped path loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) × launch-tax is a structural disadvantage A can't recover from. ### Diff \| File \| Lines \| \|---\|---:\| \| `include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp` \| +14 / −8 (one extra condition + expanded dispatch comment) \| \| `test/.../test_grouped_convnd_bwd_data.cpp` \| +9 / −0 (3 new shapes) \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 09:59:14 +03:00
assistant-librarian[bot]	6181eb2adf	[rocm-libraries] ROCm/rocm-libraries#4279 (commit 5b3f4b7) [CK_TILE] Stream-K XCD remapping (#4279) ## Proposed changes This PR adds support for XCD remapping as detailed in this [document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A). On gfx942, workgroups are typically scheduled round-robin across XCDs, which can lead to poor locality. We will use a remapping to assign workgroups to contiguous tiles in the XCDs improving the locality and the cache hit rate. This is done through a function that computes this contiguous mapping from this [PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have added to the StreamKTilePartitioner. This will require minimal changes to the Stream-K algorithm, only requiring a remap at the time the workgroups are partitioned. Through this approach we can improve the data locality by improving cache hits therefore closing performance gaps that are seen with the default scheduling. There have been unit tests added to verify the function in isolation. This is an optimization that is not specialized to just Stream-K GEMM and can be applied across GEMM. Note: This only applies to the gfx942 as they introduce the XCDs. Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged --- 🔁 Imported from [ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652) 🧑‍💻 Originally authored by @arai713 --------- Co-authored-by: Astha <astha.rai713@gmail.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>	2026-05-26 09:43:03 -07:00
Yung-sheng Tu	760f9e1d0a	[rocm-libraries] ROCm/rocm-libraries#7104 (commit 0fab8d8) [CK TILE] Unification Work – Add MFMA specialisations for `fp64_t` (#7104) ## Motivation This PR adds two specialisations related to `fp64_t`. ## Technical Details This adds two new specialisations for MFMA dense builtins, and adjusts ABLayout and CLayout to L{K1BM} and L{M1BN}. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-26 10:49:36 +00:00
Michal Kulikowski	8de4cb72fb	[rocm-libraries] direct push (commit 49b73ad) [CK][CK_TILE] POC for Instruction Cache prefetch. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2026-05-25 11:26:26 +02:00
JP-Fernando	74bc86240b	[rocm-libraries] ROCm/rocm-libraries#5647 (commit 490437a) [CK Tile] Add gemm universal preshuffle to MX GEMM (#5647) ## Motivation Add gemm universal preshuffle support to existing MX GEMM pipeline. The straightforward way to do this is to port the `mx_flatmm` pipeline to the existing `gemm_mx` framework. ## Technical Details The `mx_flatmm` pipeline was not deleted, to allow for back-compatibility. ## Test Plan Add `preshuffle` option to example: `tile_example_mx_gemm`. Add new configurations with enabled preshuffle to the existing `test/ck_tile/gemm_mx` tests. ## Test Result Example and tests were successful on `gf950` architecture in the `Alola` cluster. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-05-22 16:07:53 +02:00
Bartłomiej Kocot	ebb97044f4	[rocm-libraries] ROCm/rocm-libraries#7664 (commit de5d6b1) Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664) ## Motivation Incorrect results has been introduced for some conv bwd cases. ## Technical Details This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3. ## Test Plan CI ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. ALMIOPEN-1959	2026-05-22 12:28:49 +00:00
Wojciech Laskowski	3ea9ce7e37	[rocm-libraries] ROCm/rocm-libraries#6567 (commit 753c7a8) [CK Tile] Adding WMMA wrappers for sparse builtins (#6567) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the third of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801 and https://github.com/ROCm/rocm-libraries/pull/6014) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse WMMA intrinsics. ## Technical Details This change adds new specializations for WMMA sparse builtins. In total, we add 8 WMMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-22 13:34:33 +02:00
Illia Silin	e02c566795	[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24) [CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2026-05-22 02:43:50 +00:00
JiaLuo-CAN	5ff7497fa7	[rocm-libraries] ROCm/rocm-libraries#7537 (commit 07123f4) [CK Tile] Fix Grouped Gemm quant mixed precision (#7537) <Migrate from Internal repo PR> test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8 cases: std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant, False, True, False>, std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant, False, True, False> GFX1250 would fail with incorrect results, GFX950 would fail when compiling BF8+FP8 and give incorrect results for FP8+BF8. The issue is due to the wrong ComputeDataType selection. The fix is to consider original ADataType and BDataType even when ComputeDataType is not void. For compiling error on gfx950, the bf8, fp8, 16x16x32 warp Gemm is added.	2026-05-21 08:36:23 -07:00
Wojciech Laskowski	275629fe34	[rocm-libraries] ROCm/rocm-libraries#6014 (commit 2f8259d) [CK Tile] Adding MFMA wrappers for dense builtins (#6014) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the second of the series of PRs (after #5801) that add all the necessary MMA builtins as `amdgcn_mma` structs. This PR focuses on dense MFMA intrinsics. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we add 55 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-21 09:05:19 +02:00
JH-Leon-KIM-AMD	720ceb6500	[rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f) [CK Tile] Support multi-vector reads in static encoding patterns (#7528) ## Motivation The thread-raked / warp-raked / block-raked static tile distribution patterns in `ck_tile` silently produce wrong results when the contiguous tile dimension is larger than `warp_size * vector_size`, because the encoding has no per-thread iteration dimension along X. Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in `ConvConfigComputeV3`, the grouped convolution backward-weight example reports about 50 percent wrong values, with errors starting exactly at the `X0X1 = 64` boundary. The second pass over the contiguous dim is never performed. This PR extends the encoding so multi-vector reads in the contiguous tile dimension are supported, while keeping every existing call site bit-for-bit identical. ## Technical Details Three files changed. ### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp` Add a per-thread X iteration dimension in all three raked specializations: - `X0 = min(warp_size, XPerTile / X1)` — threads in X dim - `X1 = min(LargestVec, VecSize)` — vector size per access - `X2 = XPerTile / (X0 X1)` — number of X-iters per thread (new) `X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both `make_2d_static_tile_distribution()` and `make_shuffled_2d_static_tile_distribution()`. The new encoding places `X2` in the middle of the Ys iteration list, which preserves reverse symmetry between the regular `<..., X2, X1>` and shuffled `<X1, X2, ...>` encodings. Patterns updated: `thread_raked`, `warp_raked`, `block_raked`. ### 2. `include/ck_tile/core/tensor/transpose_tile.hpp` Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch alongside the existing `NDimY == 2` branch. The original branch is byte-for-byte unchanged. Both branches dispatch to the same `transpose_tile2d_impl_in_thread`, whose body has always been NDimY-generic (iterates with `static_for<0, NDimY, 1>` and `number<NDimY>{}`). ### 3. `experimental/grouped_convolution_tile_instances/generate_instances.py` Removed the two now-obsolete skip guards in `parse_bwd_weight_instances` and `parse_bwd_data_instances`: ```python if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector): print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.") continue ``` Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating, irregular-load shapes, scalar-per-vector > tile size) are kept untouched. ### Compatibility Strict. Every existing caller has `X2 == 1` and therefore hits the original encoding path verbatim. No upstream config or pipeline behavior changes. ## Test Plan The grouped convolution example is the natural exerciser since `GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for both A and B tiles, and all three conv directions share the same `ConvConfigComputeV3`. For each test below we ran: ``` ./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}] ``` with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both code paths: \| Test \| M / N / K \| VecA/B/C \| A path \| B path \| dtype \| \|------\|-------------\|----------\|------------\|----------------\|-------------\| \| T1 \| 16/64/32 \| 4/8/4 \| old (X2=1) \| old (X2=1) \| fp16 \| \| T2 \| 128/128/64 \| 2/2/2 \| old (X2=1) \| old (X2=1) \| fp16 \| \| T3 \| 256/256/64 \| 1/1/1 \| old (X2=1) \| new (X2=4) \| fp16 \| \| T5 \| 256/256/64 \| 1/1/1 \| old (X2=1) \| new (X2=4) \| fp16 (3 dir)\| \| T4b \| 128/128/128 \| 1/1/1 \| new (X2=2) \| new (X2=2) \| fp16 + bf16 (3 dir) \| A larger T4a (256/256/128) was attempted to stress both A and B with X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128 KB > 64 KB limit), independent of this PR. For the generator change we ran: ``` python3 generate_instances.py --mode profiler --direction all ``` and verified `Skipping instance ... with multiple warps per continous tile dim` no longer appears (count went from non-zero to 0); other skip categories are unchanged. `clang-format-18` was applied to both modified `.hpp` files (matches the repo's `.clang-format`). ## Test Result - T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`. Confirms existing callers are unaffected. - T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3 encoding + transpose branch. - T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3 `correct`. - T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs `correct`. - Generator: 0 `multiple warps per continous tile dim` skips remaining; other skips unchanged. Sample run output (T4b, bf16, bwd_data): ``` shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32 pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0 Vector size A: 1, Vector size B: 1, Vector size C: 1 0.934907 ms, 8.34683 TFlops, 34.3178 GB/s Relative error threshold: 0.00390625 Absolute error threshold: 0.25 The CPU verification result is: correct ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-20 17:25:22 +03:00
Kiefer van Teutem	b5f8bef97f	[rocm-libraries] ROCm/rocm-libraries#6088 (commit 6ac353c) [CK Tile][MFMA/WMMA unification] Add support for packed datatypes (tiny types) (#6088) ## Motivation This MR makes all the changes required for the unified architecture to be able to deal with packed datatypes i.e. int4, fp4, fp6, and bf6. The crux is that layout parameters should be interpreted as describing the pure mathematical matrix fragments, while the ext_vectors and tile distribution encodings describe everything in terms of packed datatype units. This matches how packed types are dealt with in ck_tile and should play nicely with the load and store tile ops once we integrate the unified framework into CK tile. The bf6 datatype was added to CK tile in the form of pk_bf6x16_t and pk_bf6x32_t, which did not exist before. The ext_vector implementations of pk_fp6x16_t and pk_bf6x16_t (vec size 1 and 2) were extended to make the subscripting operator work as expected. The layout test was adapted to be compatible with all packed datatypes, and all new intrinsics were added to the test. This MR adds ALL intrinsics across ALL architectures which use packed datatypes, as well as ALL scale intrinsics: mfma_scale_f32_16x16x128_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6, BF6xBF6) mfma_scale_f32_32x32x64_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6, BF6xBF6) wmma_i32_16x16x16_iu4_w32 wmma_i32_16x16x16_iu4_w32_gfx12 wmma_i32_16x16x32_iu4_w32_gfx12 ## Testing All intrinsics were tested on all architectures.	2026-05-20 12:36:13 +00:00
Enrico Degregori	9565ca21ec	[rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2) [CK Tile] Eight Waves pipeline for MX GEMM (#5552) ## Motivation Integrate Eight Waves pipeline in MX GEMM ## Technical Details - EightWaves pipeline: - Add pipeline, policy and block gemm (internally using existing implementation used by GEMM and ABQuant) - Extend support of EightWaves policy for FP4 (packed types) - Async pipeline: - Fix pipeline with packed scales (requires MRepeat and NRepeat to be contiguous) - block gemm specific for MX GEMM is defined because distribution encodings have changed - CShuffle: - Add new functionality to support MRepeat and NRepeat contiguous (defined by `TilesPacked`) - Examples: - Refactor examples to easily switch different configurations (similar to GEMM universal) - Scales values generated consistently with other microscale implementations in CK Tile - Add configuration for EightWaves pipeline - Tests: - Unify existing FP8 and FP4 tests - Add tests for EightWaves pipeline - Scales values generated consistently with other microscale implementations in CK Tile Note: FP6 support for MX GEMM was added later and the support for the Eight Waves pipeline will be done in following PR ## Test Plan Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and FP8 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-19 11:53:19 -07:00
JH-Leon-KIM-AMD	9a5d1ea791	[rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6) [CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208) ## Motivation Improve performance of grouped convolution backward-data kernels to match non-grouped kernel performance for G=1 cases. ## Technical Details - Add NoShuffle epilogue path (direct VGPR→Global writes) by setting `CDEBlockTransferScalarPerVector_NPerBlock = 1` - Add nongrouped-match instances with optimized BBlockTransfer parameters for better thread utilization - Add packed (flat) descriptor path for G=1 2D convolutions, using simpler tensor descriptors with fewer transform layers to reduce address computation overhead in the GEMM main loop - Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX zeroing cost) ## Test Plan - Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16) - Verify correctness with existing conv bwd data tests ## Test Result \| Metric \| Before \| After \| \|--------\|--------\|-------\| \| Mean ratio (grouped/nongrouped) \| 1.159 \| 1.028 \| \| Median ratio \| 1.142 \| 1.026 \| \| Cases within 2% \| 26 (4.4%) \| 186 (31.8%) \| \| Cases >20% slower \| 188 (32%) \| 2 (0.3%) \| NoShuffle + nongrouped-match instances achieve ~2.8% average gap with non-grouped kernels (down from ~16%). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com> Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com> Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 06:49:50 -07:00
Yung-sheng Tu	3ccb72e761	[rocm-libraries] ROCm/rocm-libraries#6207 (commit cc56378) [CK TILE] Unification Work – Add `print()` Utility to `MmaOpTraits` (#6207) ## Motivation It would be useful to have a `print()` utility inside of unification work's code scope, so that we can print all template params and derived params of `amdgcn_mma` for easier debugging. ## Technical Details Adding helper functions and struct to traits, adding `print_flags()` for each `DefaultCtrlFlags`, `amdgcn_target` and `MmaOpTraits` structs, and adding `print()` for `amdgcn_mma`. Note: the first commit is not* in the scope of this PR. This PR should be merged after https://github.com/ROCm/rocm-libraries/pull/5801 and https://github.com/ROCm/rocm-libraries/pull/5857. ## Test Plan Adding test in layout test. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-18 13:02:38 +02:00
Illia Silin	df07f060c1	[rocm-libraries] ROCm/rocm-libraries#7471 (commit 13b9eec) [CK] increase timeout limit for fmha_fwd tests to avoid CI failure on gfx11 (#7471) ## Motivation This should prevent fmha_fwd tests from timing out on one of the slower gfx11 CI nodes and generating false CI failures. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-15 16:41:23 +00:00
Bartłomiej Kocot	067e5e0ca4	[rocm-libraries] ROCm/rocm-libraries#6838 (commit ff7a665) [CK_TILE] Add depthwise conv2d forward kernel (FP16/FP32) (#6838) ## Motivation CK currently has no kernel optimized for depthwise convolution (G=C_in=C_out, C=K=1 per group) and existing generic paths perform poorly for this workload. This PR adds a dedicated depthwise conv forward kernel in CK Tile. ## Technical Details Adds a dedicated depthwise conv2d forward op to CK Tile that performs direct convolution rather than falling back to the generic GEMM path. The kernel is templatized by filter size, stride, and data type, and compiled into ~60 instances covering common configurations (kernel 3/5/7/9, stride 1/2, FP16/FP32). Supports both CDNA (gfx942/gfx950) and RDNA (gfx1100/gfx1200) architectures. ## Test Plan - [x] Correctness and performance validated on gfx942, gfx950, and gfx1100, with ckProfiler `grouped_conv_fwd` as baseline. - [ ] MI300A (gfx942) and gfx1200 validation. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1137 --------- Co-authored-by: GenDu <Gen.Du@amd.com>	2026-05-15 15:47:55 +02:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Linjun-AMD	5003f7ef8a	[rocm-libraries] ROCm/rocm-libraries#7272 (commit d02f3c0) [ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner (#7272) ## Summary In `fmha_bwd_runner.hpp`, the `sink_host` `HostTensor` is allocated with first dimension `shape_batch` (= 1 in group mode), but the reference forward loop accesses `sink_host(wb, i_h)` with `wb ∈ [0, batch-1]`. For any `wb >= 1` this is an out-of-bounds heap read, silently corrupting the reference forward math chain (`lse_host`, `o_host`) and turning the bwd-side `d_sink_head_acc` reference into non-deterministic garbage. `HostTensor::operator()` does not bounds check, so the OOB is not caught at runtime. This manifests as intermittent `tile_example_fmha_bwd` failures (25–67% fail rate) when `-sink_grad=1` is combined with `-mode=1` (group mode), with bit-exact but spurious `max_err` values like 4.27 / 14.6. ## Fix One-line: allocate `sink_host` with `batch` (the real per-batch dim) instead of `shape_batch`, mirroring how `sink_host` is accessed by the loop. ```diff - sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead} + sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead} Repro tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \ -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \ -v=3 -mode=1 -kname=1 -sink_grad=1 Verification - 0/30 fail on the repro config after fix - Baselines (before fix): - sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4) - sink=1, mask=t: 67% fail rate (p ≈ 6e-15) Attribution Shape bug introduced together with sink_grad in #5504. Unrelated to #6914 (which is a fwd-only fix on a different code path) ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: junlin12 <junlin12@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-05-13 16:47:50 +08:00
Illia Silin	22b9feb40f	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings (#7111) ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 07:14:14 -07:00
Wojciech Laskowski	640bd560ec	[rocm-libraries] ROCm/rocm-libraries#5801 (commit 27f6d15) [CK Tile] Adding WMMA wrappers for dense builtins (#5801) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the first of the series of PRs that add all the necessary MMA builtins as a `amdgcn_mma` structs. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we have now 9 RDNA4 builtins and 3 RDNA3 builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yung-sheng Tu <yung-sheng@streamhpc.com>	2026-04-27 11:57:51 +00:00
Sami Remes	de3fa71992	[rocm-libraries] ROCm/rocm-libraries#6611 (commit 5375c0f) [CK_TILE] Preserve input strides in EightWaves async-load descriptor (#6611) `MakeAsyncLoadADramWindow` in `GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes strides from lengths and assumes a dense layout. When the input view's leading-dim stride is larger than its inner length (non-packed memory layout), the resulting tile window stepped through memory at the wrong stride. Compose the unmerge transforms on top of the input view's existing descriptor instead, so the actual runtime strides are preserved and the correct `element_space_size` is inherited for bounds checking. ## Test Plan Added an unit test showing the problem. ## Test Result The new test fails before fixes and passes after. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 12:52:02 +02:00
Bartłomiej Kocot	9348b8eb82	[rocm-libraries] ROCm/rocm-libraries#5646 (commit 05680a4) [CK_TILE] Add conv bwd data tests (#5646) ## Motivation This PR adds tests for CK Tile's convolution backward data operation to enable functionality regression tracking and error-detection. ## Technical Details Currently only NHWGC/GKCYX/NHWGK and NDHWGC/GKCZYX/NDHWGK(2 dim and 3 dim channel-last) layouts are being tested, since only they are implemented in CK Tile. Current tests support FP16, BF16 and FP32 datatypes and various different convolutions scenarios. The tested instances are listed in `experimental/grouped_convolution_tile_instances` directory. ## Test Result All implemented tests are working properly and passing. --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com> Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>	2026-04-21 21:49:19 +00:00
Yung-sheng Tu	5d36cad34a	[rocm-libraries] ROCm/rocm-libraries#5857 (commit d77cd41) [CK TILE] Unification of Scale MFMA/WMMA Policy Structs (#5857) ## Motivation The existing unification work supports DENSE and SPARSE intrinsics. In this PR, we enable support for SCALE intrinsics and add example SCALE implementations. ## Technical Details Adding MFMA SCALE intrinsics support, adding tests for MFMA SCALE intrinsics, and adding WMMA SCALE policy trait. Note: fp6 SCALE intrinsics support is not included in this PR, as its handling in ck_tile is currently more specialized and does not follow the same pattern as other datatypes. ## Test Plan Added new tests for the relevant SCALE specialisations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:28:23 +00:00
Max Podkorytov	3aee45e115	[rocm-libraries] ROCm/rocm-libraries#5383 (commit b660b8c) [CK_TILE] Add CShuffleLds microbenchmark suite (#5383) ## Summary Microbenchmarks isolating LDS store/load operations in CShuffleEpilogue for bank conflict analysis. ## Motivation CShuffleEpilogue performs LDS store (MFMA registers → LDS) and load (LDS → registers for coalesced global writes). This suite isolates each operation to: - Identify which operation causes bank conflicts - Measure pure LDS bandwidth per access pattern - Validate access patterns across MFMA tile sizes and wave layouts ## Components - Microkernels (`tile_load_store_microkernels.hpp`): `StoreTile<Setup>`, `LoadTile<Setup>` - Setup Adapters (`benchmark_cshuffle_lds.hpp`): Wire CShuffleEpilogue to microkernels - Template (`benchmark_template.cpp.in`): Generated benchmarks with timing ## Build ```bash cmake -G Ninja -B build -S . \ -DGPU_TARGETS=gfx950 \ -DBUILD_CK_EXAMPLES=ON \ -DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON ninja -C build bench_lds_fp8_16x16x128_2x2_fp8 ``` ## New CMake Options \| Option \| Default \| Description \| \|--------\|---------\|-------------\| \| `BUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS` \| OFF \| LDS microbenchmarks \| \| `BUILD_CK_TILE_FMHA_TESTS` \| ON \| FMHA tests \| \| `BUILD_CK_TILE_ENGINE` \| ON \| Tile engine \| \| `BUILD_CK_TILE_ENGINE_TESTS` \| ON \| Tile engine tests \| \| `BUILD_CK_EXAMPLES` \| ON \| Examples \| \| `BUILD_CK_TUTORIALS` \| ON \| Tutorials \| \| `BUILD_CK_DEVICE_INSTANCES` \| ON \| Device instances \| \| `BUILD_CK_PROFILER` \| ON \| Profiler \| Setting guards to OFF reduces cmake configure from ~150s to ~5s. --------- Made-with: Claude Code, Opus 4.5	2026-04-14 20:43:23 -07:00
msaffari-amd	cf517ec050	[rocm-libraries] ROCm/rocm-libraries#5863 (commit 31d9247) [CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863) ## Motivation The PermuteN epilogue was previously embedded within cshuffle_epilogue.hpp, despite having fundamentally different behaviour. Coupling these two independent strategies in one file introduced unnecessary complexity, SFINAE guards, and a dual operator() overload selected at compile time via TiledMMAPermuteN_ template parameter. This PR separates PermuteN into its own standalone file(pertmuten_epilogue.hpp), simplifying both implementations and making the codebase easier to maintain and extend independently. ## Technical Details New file: permuten_epilogue.hpp: contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from the permuteN code path in cshuffle_epilogue.hpp. Cleanup of cshuffle_epilogue.hpp: - Removed the TiledMMAPermuteN_ template parameter from [CShuffleEpilogueProblem] - Removed the SFINAE-guarded permuteN operator() overload - Removed the EnablePermuateN_ SFINAE alias - CShuffle now only contains CShuffle logic; EightWave support (independent feature) is retained Consumer migration : All consumer files now use compile-time epilogue selection via [std::conditional_t] `using GemmEpilogue = std::conditional_t< TiledMMAPermuteN, PermuteNEpilogue<PermuteNEpilogueProblem<...>>, CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;` Files modified: - flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp, mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples - run_gemm_quant_example.inc — block-scale GEMM example - gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker - test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp, test_gemm_pipeline_util.hpp — test utilities - universal_gemm_invoker.hpp — universal GEMM invoker - epilogue.hpp — add header updated to include permuten_epilogue.hpp ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-04-14 20:22:18 +00:00
arai713	12f3d646a0	[rocm-libraries] ROCm/rocm-libraries#4769 (commit 72ae66e) [CK_TILE] Restructure Tile Engine's benchmarking and profiling (#4769) ## Motivation This PR introduces a restructure for the benchmarking and profiling aspects of CK Tile's Tile Engine, expanding on the groundwork from this previous https://github.com/ROCm/composable_kernel/pull/3434 and outlined in this [design document](https://amdcloud-my.sharepoint.com/:w:/r/personal/astharai_amd_com/Documents/Restructuring%20Tile%20Engine.docx?d=w14ea28a30718416988ed5ebb759bd3b2&csf=1&web=1&e=l3VBuX). In PR 3434, to reduce repeated code we implemented: - Base class that centralizes common functionality and provides a default implementation (Universal GEMM) - Child classes for GEMM variants override virtual functions to handle variant-specific behavior This refactoring in this PR follows the same process and should greatly reduce the duplicated code present in Tile Engine and make it simpler to add in new operations, increasing scalability. ## Technical Details The files have been refactored around new base structs for benchmarks, profiling and problem descriptions. The new base structs are: - GemmProblem - GemmBenchmark - GemmProfiler Universal GEMM, Preshuffle GEMM, and Multi-D GEMM all have child classes that will inherit from these base structs overriding only what differs per variant. All common functions across the benchmarking and profiling files have been moved into newly added common utility files under the commons/ directory. The new utility files are: - utils.hpp: common functions for the benchmarking and profiling process - benchmark_utils.py: common utility functions for the benchmark generation ## Test Plan I tested using the existing tests for Tile Engine. ## Test Result All tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 10:50:24 -07:00
Po Yen Chen	0d53e3674b	[rocm-libraries] ROCm/rocm-libraries#6342 (commit 31bcb51) [CK] Skip fp16 dropout d256 batch tests for compiler VGPR aliasing bug (#6342) ## Summary - Skip fp16 FMHA forward dropout tests that use the d256 tile in batch mode, gated on compiler version - The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of Philox RNG parameters under high register pressure (383 VGPRs) - bf16 dropout tests are unaffected and cover the same code paths ## Root Cause The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state stored in VGPRs) with other live data during the softmax main loop. This causes corrupted `buffer_store_byte` writes for dropout randval on wave lanes 32-63, producing NaN in output and LSE tensors. Conditions: fp16 + d256 tile + dropout + batch mode + `qr` pipeline + gfx90a ## Changes - `include/ck_tile/core/config.hpp`: Add `CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro - `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in `TEST_P(Dropout, ...)` ## Test plan - [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip active) - [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED - [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)	2026-04-14 14:07:20 +00:00
chris-tsiaousis-hpc	1c04029c0e	[rocm-libraries] ROCm/rocm-libraries#5508 (commit 0ad0aca) [CK Tile] Unification work - mma transformations pipeline (#5508) ## Motivation In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing. For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform. A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR. The pipeline base class is basically at the RFC stage. We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline. ## Technical Details The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed. ## Test Plan New tests are added. ## Test Result Tests should pass. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-04-14 09:25:01 +02:00
Brock Hargreaves	a4e36c1b89	[rocm-libraries] ROCm/rocm-libraries#6400 (commit c0b3c95) [MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400) Reverts ROCm/rocm-libraries#6343 This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.	2026-04-13 20:46:07 -06:00
Ville Pietilä	0f2279920b	[rocm-libraries] ROCm/rocm-libraries#6343 (commit 3604475) [CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343) ## Motivation Due to compiler version update, there are test failures in the test suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current `develop` branch, the test failures are blocking any PR merges into `develop`. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. ## Technical Details This PR disables the compilation of the problematic bwd weight conv instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that allows us to detect when we are compiling for `gfx90a`. Using the new CMake flag, compilation of instance `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`. Co-authored-by: Ville Pietilä <>	2026-04-13 13:40:27 +02:00
Kiefer van Teutem	b7156a8bbf	[rocm-libraries] ROCm/rocm-libraries#5515 (commit 40f66c3) [CK Tile] Add Tile Distribution Encoding Calculator (#5515) ## Motivation We want to be able to calculate TileDistributionEncodings describing register mappings for any MmaOp. This is necessary for further integration with CK Tile. This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma type (MmaOp) and provides ABC warp distribution encodings for mapping matrix fragment coordinates to register coordinates (lane, vector item) and vice versa. It is able to take CTranpose, Swizzle, and NumAccessA / NumAccessB template parameters for tweaking the tile distributions. Swizzle modification will be implemented later. The current implementation can deal with all intrinsic types and block-hiding. This MR also adds some additional static asserts and derived params within amdgcn_mma_base, to enforce consistency and help calculate Tile Distributions for block-hiding intrinsics. An Example was added that uses the Tile Distr Enc Calc to calc and print register layouts for Tile Distributions for some of our amdgcn_mma structs. It also makes sure that the CTranspose modifier works as intended. Some additional gfx9 intrinsics were added to test block-hiding layouts for the different types of C-block-hiding layouts. The sparse intrinsic wrappers were updated according to Chris's recent changes in another branch (https://github.com/ROCm/rocm-libraries/pull/5508), which moved the compression step outside of the intrinsic itself. This is necessary to make sure that the Calculator can deal with this new interpretation of the sparse intrinsics. I directly copied the new amdgcn structs from Chris's branch and changed nothing else to avoid more complex merges in the future. Note that this means I did not update a bunch of related sparse code since that would be a lot, and therefore I disabled test_amdgcn_sparse_mma for now. The amdgcn_mma_layout test was refactored a bit: - The old register mapping utility was removed and its use was replaced by the new TileDistrEncCalc - More tests were added to test layouts for different types of block-hiding and sparse intrinsics - The Selector method was removed and the tests were split up over target architectures, with each target arch having a direct list of amdgcn structs to be tested. This ensures that we force specific tests on specific architectures and makes sure that the selector doesn't quietly do some workarounds like creating compound intrinsics. ## Test Results Layout tests based on calculated tile distribution encodings pass on all architectures. Calculator works for all currently added amdgcn structs, which includes different types of block-hiding and sparse intrinsics. Printed layouts from new example verified by eye. CTranspose modifier tested for large set of intrinsics.	2026-04-13 08:00:31 +00:00
Aviral Goel	81a5826132	[rocm-libraries] ROCm/rocm-libraries#6323 (commit a668483) CK: Extract shared boilerplate from 47 gemm_quant test files (#6323) Depends on #6303 ## Summary Extract shared test boilerplate (includes, type aliases, test fixture macros) from 47 `test_gemm_quant_` files into a single `test_gemm_quant_common.hpp` header. Each test file is reduced from ~50 lines of boilerplate to ~5 lines. \| Metric \| Value \| \|--------\|-------\| \| Files changed \| 48 \| \| Insertions \| +413 \| \| Deletions \| −1,106 \| \| Net lines removed* \| −693 \| ### What changed \| Before \| After \| \|--------\|-------\| \| 47 test files, each with ~50 lines of identical includes, type aliases, and fixture macros \| 1 shared header (`test_gemm_quant_common.hpp`) + 47 thin files (~5 lines each: include + params) \| ### Readability assessment A code realist review confirmed this change improves readability: the 47 test files had identical boilerplate obscuring the only meaningful content — the `GemmConfig` type alias and test dimensions. After the refactoring, each file's unique configuration is immediately visible, and adding a new test variant requires specifying only the varying parameters instead of copying 50 lines. ### Cumulative cleanup series stats \| PR \| Description \| Net lines \| \|----\|-------------\|-----------\| \| #6300 \| Remove 61 dead `#if 0` blocks \| −2,648 \| \| #6302 \| Remove 41 commented-out dead code blocks \| −2,861 \| \| #6303 \| Remove 4 orphaned files \| −3,886 \| \| This PR \| Extract gemm_quant test boilerplate \| −693 \| \| Total \| \| −10,088 \|	2026-04-11 06:00:26 -04:00
Aviral Goel	c2663ce9fd	[rocm-libraries] ROCm/rocm-libraries#6303 (commit 784c268) CK: Remove 4 orphaned files with verified replacements (~1,025 lines) (#6303) Depends on #6302 ## Summary Remove 4 orphaned files that have verified replacements already in the build. \| File \| Reason \| Replacement \| \|------\|--------\|-------------\| \| `test_gemm_pipeline_compiler.cpp` \| Refactored into 13 smaller tests \| `_compv3`, `_compv4`, `_mem`, `_persistent`, etc. \| \| `test_grouped_gemm_quant.cpp` \| Refactored into 5 smaller tests \| `_rowcol`, `_tensor`, `_aquant`, `_bquant`, etc. \| \| `..._f8_f8_f16_..._comp_default_instance.cpp` \| Superseded by split files \| `_part1.cpp` + `_part2.cpp` \| \| `..._f8_f8_f16_..._comp_kpadding_instance.cpp` \| Superseded by split files \| `_part1.cpp` + `_part2.cpp` \| Each deletion was verified: - Original file is NOT in any CMakeLists.txt - Replacement files ARE in CMakeLists.txt and actively compiled - Content is fully covered by the replacement files	2026-04-10 11:22:31 -04:00
Aviral Goel	c7eb33078c	[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8) CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302) Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 11:17:11 -04:00
Christopher Millette	c8cf71f179	[rocm-libraries] ROCm/rocm-libraries#5938 (commit 73f3650) [CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938) ## Problem Each `static_for<0, N, 1>` instantiates its lambda N times (one per `number<I>` type). When nested, intermediate lambdas capture the outer loop variable (a different type per iteration), creating unique closure types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32 = 52 IR functions, of which 20 are intermediate closures that get inlined away but still cost frontend compile time. ck_tile's `static_ford` was supposed to eliminate these intermediates (as old CK's PR #5031 did successfully), but it used a recursive `static_ford_impl` that recreated the same closure pattern plus added `reorder_old_to_new`/`reorder_new_to_old` overhead. Additionally, the sequence utility layer (`sequence_sort`, `is_valid_sequence_map`) used recursive template metaprogramming that generated O(N log N) intermediate types for every permutation validation — called on every `reorder_new_to_old`/`reorder_old_to_new` invocation. ## Changes ### 1. Replace `sequence_sort` with constexpr insertion sort Replace recursive merge sort (`sequence_sort_impl` + `sorted_sequence_merge_impl`, O(N log N) intermediate type instantiations) with constexpr insertion sort using `static_array`. O(1) template depth, same `::type` and `::sorted2unsorted_map` API. ### 2. Replace `is_valid_sequence_map` with constexpr check Replace sort-based permutation validation (which instantiated the full `sequence_sort` chain) with a constexpr "seen array" loop. O(N) constexpr steps instead of O(N log N) template instantiations. ### 3. Replace recursive `static_ford` with flat-loop `index_decomposer` Replace `static_ford_impl` (recursive `static_for` nesting + `pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat `index_decomposer` using pre-computed strides. Add `decompose_reordered` alias that folds reordering into decomposition, and `inverse_perm` helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map` → `sequence_sort` chain. ### 4. Eliminate internal lambda via `ford_applier` The flat-loop approach still used `static_for` with a lambda, creating M×N internal lambda instantiations per call site. Replace with `ford_applier` struct that calls `f(decompose<I>{})` directly via fold expression — zero intermediate closures: ```cpp // Before: 2×M×N function instantiations static_for<0, MN, 1>{}([&](auto i) { f(decompose<i>{}); }); // After: M×N function instantiations (50% reduction) ford_applier<Decomposer, make_index_sequence<MN>>{}(f); ``` Also unified identity and non-identity order paths into a single template with `constexpr if`. ### 5. Fix const-qualified sequence handling Fix `is_valid_sequence_map` to handle const-qualified sequence types via `remove_cvref_t` in callers (`tensor_adaptor.hpp`, `tile_distribution_encoding.hpp`). ## Results (this PR only, without flattening) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Wins \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|------\|-------------\| \| flatmm \| 160.1 \| 152.7 \| -7.4s \| -4.6% \| 6/7 \| YES (W+=1, p<0.05) \| \| universal_gemm \| 228.4 \| 224.7 \| -3.7s \| -1.6% \| 6/7 \| Trending (W+=4) \| Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5] Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6] ### IR Function Counts (device trace, gfx942) \| Target \| Metric \| Before \| After \| Delta \| % \| \|--------\|--------\|--------\|-------\|-------\|---\| \| universal_gemm \| InstantiateFunction \| 117,715 \| 109,165 \| -8,550 \| -7.3% \| \| universal_gemm \| CodeGen Function \| 47,912 \| 45,044 \| -2,868 \| -6.0% \| \| flatmm \| InstantiateFunction \| 100,939 \| 95,127 \| -5,812 \| -5.8% \| \| flatmm \| CodeGen Function \| 42,651 \| 40,367 \| -2,284 \| -5.4% \| Note: The `ford_applier` (commit 3) has minimal additional effect in this PR since ck_tile code does not yet use `static_ford` extensively. Its impact compounds when the follow-up flattening PR #5939 converts 124 `static_for` nests to `static_ford`. Combined results with #5939: flatmm -7.5% wall time (p<0.01), CodeGen -10.5%. ### ASM Equivalence 7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 + gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] `test_ck_tile_static_ford`: 13 behavioral tests (identity/non-identity orders, 1D-4D, unit dimensions, edge cases) - [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map, is_valid_sequence_map edge cases, sequence_unique_sort map round-trip) - [x] ASM equivalence verified (980K lines) - [x] Wilcoxon timing verified (7 trials, flatmm p<0.05) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 15:25:14 -06:00

1 2 3 4 5 ...

710 Commits