Commit Graph

1551 Commits

Author SHA1 Message Date
Ding, Yi
92f2ed758e [fmha-bwd] Implement group-mode persistent scheduling with optimized state management 2026-04-16 22:27:25 -05:00
Ding, Yi
1a9404ac96 [CK_TILE] Use Persistent Scheduling for FMHA BWD Group Deterministic 2026-04-09 03:24:49 -05:00
Ding, Yi
bedd60a568 Merge remote-tracking branch 'origin/develop' into users/yiding12/fmha-bwd-workspace 2026-04-07 23:13:09 -05:00
Ding, Yi
5a63b343d0 Fix 2026-04-07 23:12:39 -05:00
Christopher Millette
870112b861 [CK_TILE] Flatten nested static_for loops into static_ford (#5939)
## Summary
Mechanical conversion of 129 nested `static_for`/`static_ford` patterns
to flat `static_ford` across 29 ck_tile header files.

Each conversion eliminates intermediate lambda closure instantiations by
replacing nested compile-time loops with a single flat iteration using
index decomposition.

### What `static_ford` eliminates

When `static_for` loops are nested, each level creates unique closure
types:
```cpp
// BEFORE: M + M×N = 20 IR functions (for M=4, N=4)
static_for<0, 4, 1>{}([&](auto m) {        // 4 closure instantiations
    static_for<0, 4, 1>{}([&](auto n) {     // 4×4 = 16 closure instantiations
        body(m, n);
    });
});

// AFTER: M×N = 16 IR functions (with ford_applier, no intermediates)
static_ford<sequence<4, 4>>{}([&](auto mn) {
    constexpr auto m = number<mn[number<0>{}]>{};
    constexpr auto n = number<mn[number<1>{}]>{};
    body(m, n);
});
```

### Pattern categories converted

| Category | Count | Description |
|----------|-------|-------------|
| C (2-level `static_for` chains) | 112 | Nested `static_for` →
`static_ford` |
| C3 (3-level `static_for` chains) | 9 | Three consecutive nests →
`static_ford` |
| Partial rescue | 3 | Outer 2 levels of blocked 4-level nests |
| B (nested `static_ford` merge) | 5 | Two nested `static_ford` → single
higher-dim `static_ford` |
| **Total** | **129** | Across 29 files |

6 false positives were detected and reverted (in `tensor_adaptor.hpp`,
`tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the
inner loop bound depended on the outer variable.

### Files changed by family

| Family | Files | Sites |
|--------|-------|-------|
| Block GEMM | 12 | ~20 |
| FlatMM pipelines | 4 | ~69 (including 5 ford-ford merges) |
| GEMM quant | 7 | ~22 |
| FlatMM kernel | 1 | 2 |
| FMHA | 1 | 2 |
| Reduce/norm | 2 | 2 |
| Epilogue | 1 | 1 |

### Blocked locations from review comments

- **block_gemm_areg_breg_creg_v1.hpp:356** — BLOCKED: runtime scale
loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between
every nesting level
- **block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228** — BLOCKED:
`zero_accumulators()` before inner loop; `sched_barrier` + conditional
`block_sync_lds()` after inner loop
- **block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298** — BLOCKED:
runtime `CWarpTensor` construction before inner loop; quantization scale
application code after inner loop
- **block_universal_gemm_as_aquant_bs_cr.hpp:277** — BLOCKED: same
pattern as above
- **block_universal_gemm_as_bs_bquant_cr.hpp:367** — BLOCKED: same
pattern as above

## Depends on
- #5938 ([CK_TILE] Optimize static_ford and sequence compile-time
infrastructure) — provides the `ford_applier` that makes these
conversions beneficial. Without it, `static_ford` uses a recursive
implementation that provides no IR function savings.

## Results (combined with #5938)

### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942)

| Target | Base (s) | Treat (s) | Delta | % | Significant? |
|--------|----------|-----------|-------|---|-------------|
| **flatmm** | 161.1 | 149.0 | **-12.1s** | **-7.5%** | **YES** (p<0.01,
7/7 wins) |
| **universal_gemm** | 225.4 | 220.3 | **-5.1s** | **-2.3%** | **YES**
(p<0.01, 7/7 wins) |

### IR Function Counts (device trace, gfx942)

| Target | InstFunc | CodeGen |
|--------|----------|---------|
| universal_gemm | **-8.5%** | **-9.2%** |
| flatmm | **-7.6%** | **-10.5%** |

### ASM Equivalence
5/5 PASS — 650,151 lines verified identical (gfx942). TUs:
universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale.

## Test plan
- [x] ASM equivalence verified (650K lines, gfx942)
- [x] Wilcoxon timing verified (7 trials, p<0.01)
- [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction)
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2026-04-07 08:36:45 -06:00
Po Yen Chen
6dc44114ba [CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051)
## Motivation

The existing FMHA V3 pipeline only supports fp16/bf16 data types. This
PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950,
enabling higher throughput for
  FP8 inference workloads using the assembly-optimized V3 code path.

  ## Technical Details

  **Warp GEMM:**
- Add FP8 32x32x32 warp gemm with C-transposed distribution
(`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries

  **V3 Kernel (`fmha_fwd_v3_kernel.hpp`):**
- Add per-tensor descale support for Q, K, V tensors, passing descale
pointers through to pipeline kargs

  **V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`):**
  - Add FP8 data path with dtype-aware type selection
  - Add asm volatile P matrix conversion from f32 to fp8
  - Add FP8-aware instruction scheduling in `CoreLoopScheduler`

**V3 Pipeline Policy
(`block_fmha_fwd_v3_pipeline_default_policy.hpp`):**
- Add FP8 QK warp gemm selection (SwizzleB variant for V tile
distribution compatibility)

  **Codegen (`fmha_fwd.py`):**
  - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128)
- Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale:
no/pertensor)
  - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor

  **Misc:**
- Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum
(`arch.hpp`)
- Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`)

V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`)
pending further validation.

## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950)

  | Problem | Regular (TFlops) | Varlen (TFlops) |
  |---|---:|---:|
  | batch=1 heads=6/1 seqlen=1024 causal | 48.9 | 47.6 |
  | batch=1 heads=6/1 seqlen=2048 causal | 119.8 | 117.4 |
  | batch=1 heads=6/1 seqlen=4096 causal | 263.7 | 259.2 |
  | batch=1 heads=6/1 seqlen=8192 causal | 548.9 | 543.6 |
  | batch=1 heads=6/1 seqlen=16384 causal | 1043.0 | 1063.7 |
  | batch=1 heads=6/1 seqlen=32768 causal | 1237.2 | 1279.6 |
  | batch=1 heads=6/1 seqlen=65536 causal | 1315.4 | 1382.7 |
  | batch=1 heads=6/1 seqlen=131072 causal | 1326.3 | 1402.2 |
  | batch=1 heads=16/1 seqlen=65536 causal | 1298.7 | 1388.4 |
  | batch=1 heads=40/40 seqlen=37200 non-causal | 1248.9 | 1326.1 |

## Test Plan

Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering
batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40),
GQA ratios (1:1, 1:8), and
causal/non-causal modes. Verified all cases dispatch to the V3 pipeline
by enabling `F_is_v3_enabled` and confirming kernel names contain
`qr_async_trload_v3`.

  ## Test Result

176/176 tests passed with V3 enabled. All cases correctly dispatched to
V3 pipeline with `pertensor` quantization.

  ## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-07 22:19:28 +08:00
Jeff Huang
449844e3d3 [CK_TILLE] Temporarily remove batch prefill KV cache overflow asserts (#6201)
## Summary
- Temporarily remove the KV cache offset overflow assert checks in
`FmhaBatchPrefillWithPagedKVCacheKernel`
- The asserts are **correct**, but they block project progress in
certain configurations
- This is a **temporary workaround** to unblock progress; a proper fix
will follow

## Note
This is NOT a permanent solution. A follow-up PR will add proper
overflow handling that addresses the underlying issue without blocking
progress.
2026-04-07 20:41:24 +08:00
Ding, Yi
3848d2411a Merge origin/develop into users/yiding12/fmha-bwd-workspace 2026-04-07 05:28:49 -05:00
Bartłomiej Kocot
4112e08d0c [CK][CK Tile] Conv Bwd Data flush cache and profiling improvements (#6090)
## Motivation

Improve accuracy of conv bwd data perf measurements

## Technical Details
- enable flush cache
- for grouped conv we zero conv input(gemm output) inside device op, so
we also include this in time measurement
- for non-grouped conv we zero conv input(gemm output) outside device op
(in profile_conv_bwd_data_impl.hpp) so it is not included.
- In this pr I changed it to include zeroing if time_kernel/flush cache
is enabled so at now you should have more fair comparison. I changed it
only for time_kernel/flush_cache because MIOpen run own zeroing for
non-grouped solvers.

## Test Plan

test_grouped_conv_bwd_data_*

## Test Result

CI pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-04 00:22:22 +00:00
harkgill-amd
12fe2c3de4 Add missing gfx1033 to gfx103 group definition in ck (#5141)
## Motivation

Resolving PyTorch build failures when enabling builds for gfx103X-all
family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033`
is the only failing architecture in the family and the failures point to
missing support in CK.

## Technical Details

PyTorch build fails with repeated error message
```
/__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
   33 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
      |                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
`gfx1033` is missing from the `__gfx103__` group which results in
`CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in
`gfx1033` to the missing files which should be the minimum fix to allow
torch builds to pass.

## Test Plan

Compile sample test file and target gfx1033
```
...
#ifdef __HIP_DEVICE_COMPILE__
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value");
#else
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value");
#endif
```

## Test Result

Prior to the applying patch, compilation fails with `error: use of
undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'`

After applying patch, test file compiles successfully.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-04-03 13:44:38 -06:00
Ding, Yi
28afc8fee3 [CK_TILE] Use Unified Workspace for FMHA BWD 2026-04-03 03:54:38 -05:00
Thrupti Raj Lakshmana Gowda
8c42eaf3f8 [CK Tile] Fix architecture-dependent EightWave assignment in cshuffle_epilogue (#6102)
## Motivation
Tile engine CI build on the develop branch started failing after a
recent change(https://github.com/ROCm/rocm-libraries/pull/5218) in
`cshuffle_epilogue.hpp`. The `EightWave` constant was unconditionally
computed as `(MWave * NWave == 8)` for all architectures, but this logic
is only valid for gfx9*. On other architectures (e.g., gfx1201),
`EightWave` must always be `false`, otherwise it leads to incorrect
`BlockedXDLN_PerWarp` computation and build failures.

## Technical Details
In `cshuffle_epilogue.hpp`, the `EightWave` static constexpr was set as:
```cpp
static constexpr bool EightWave = (MWave * NWave == 8);
```
This was applied regardless of the target GPU architecture. The fix uses
a preprocessor guard to make this architecture-aware:
- **gfx9* (`__gfx9__`):** `EightWave` is evaluated as `(MWave * NWave ==
8)` — true or false depending on the wave configuration
- **All other architectures:** `EightWave` defaults to `false`

## Test Plan
- Tile engine CI build on develop branch

## Test Result
- *Pending CI*

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---
> **Note:** This PR supersedes ROCm/rocm-libraries#5436, which is
blocked pending a review approval from a reviewer currently on PTO. The
same changes have been applied to this branch
(`users/tlakshma/ck/develop-clone`) to allow merging.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-04-02 20:55:03 -06:00
Christopher Millette
522902f29b [CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938)
## Problem

Each `static_for<0, N, 1>` instantiates its lambda N times (one per
`number<I>` type). When nested, intermediate lambdas capture the outer
loop variable (a different type per iteration), creating unique closure
types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32
= 52 IR functions, of which 20 are intermediate closures that get
inlined away but still cost frontend compile time.

ck_tile's `static_ford` was supposed to eliminate these intermediates
(as old CK's PR #5031 did successfully), but it used a **recursive**
`static_ford_impl` that recreated the same closure pattern plus added
`reorder_old_to_new`/`reorder_new_to_old` overhead.

Additionally, the sequence utility layer (`sequence_sort`,
`is_valid_sequence_map`) used recursive template metaprogramming that
generated O(N log N) intermediate types for every permutation validation
— called on every `reorder_new_to_old`/`reorder_old_to_new` invocation.

## Changes

### 1. Replace `sequence_sort` with constexpr insertion sort
Replace recursive merge sort (`sequence_sort_impl` +
`sorted_sequence_merge_impl`, O(N log N) intermediate type
instantiations) with constexpr insertion sort using `static_array`. O(1)
template depth, same `::type` and `::sorted2unsorted_map` API.

### 2. Replace `is_valid_sequence_map` with constexpr check
Replace sort-based permutation validation (which instantiated the full
`sequence_sort` chain) with a constexpr "seen array" loop. O(N)
constexpr steps instead of O(N log N) template instantiations.

### 3. Replace recursive `static_ford` with flat-loop `index_decomposer`
Replace `static_ford_impl` (recursive `static_for` nesting +
`pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat
`index_decomposer` using pre-computed strides. Add `decompose_reordered`
alias that folds reordering into decomposition, and `inverse_perm`
helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map`
→ `sequence_sort` chain.

### 4. Eliminate internal lambda via `ford_applier`
The flat-loop approach still used `static_for` with a lambda, creating
M×N internal lambda instantiations per call site. Replace with
`ford_applier` struct that calls `f(decompose<I>{})` directly via fold
expression — zero intermediate closures:

```cpp
// Before: 2×M×N function instantiations
static_for<0, M*N, 1>{}([&](auto i) { f(decompose<i>{}); });

// After: M×N function instantiations (50% reduction)
ford_applier<Decomposer, make_index_sequence<M*N>>{}(f);
```

Also unified identity and non-identity order paths into a single
template with `constexpr if`.

### 5. Fix const-qualified sequence handling
Fix `is_valid_sequence_map` to handle const-qualified sequence types via
`remove_cvref_t` in callers (`tensor_adaptor.hpp`,
`tile_distribution_encoding.hpp`).

## Results (this PR only, without flattening)

### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5)

| Target | Base (s) | Treat (s) | Delta | % | Wins | Significant? |
|--------|----------|-----------|-------|---|------|-------------|
| **flatmm** | 160.1 | 152.7 | **-7.4s** | **-4.6%** | 6/7 | **YES**
(W+=1, p<0.05) |
| universal_gemm | 228.4 | 224.7 | -3.7s | -1.6% | 6/7 | Trending (W+=4)
|

Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5]
Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6]

### IR Function Counts (device trace, gfx942)

| Target | Metric | Before | After | Delta | % |
|--------|--------|--------|-------|-------|---|
| **universal_gemm** | InstantiateFunction | 117,715 | 109,165 |
**-8,550** | **-7.3%** |
| **universal_gemm** | CodeGen Function | 47,912 | 45,044 | **-2,868** |
**-6.0%** |
| **flatmm** | InstantiateFunction | 100,939 | 95,127 | **-5,812** |
**-5.8%** |
| **flatmm** | CodeGen Function | 42,651 | 40,367 | **-2,284** |
**-5.4%** |

Note: The `ford_applier` (commit 3) has minimal additional effect in
this PR since ck_tile code does not yet use `static_ford` extensively.
Its impact compounds when the follow-up flattening PR #5939 converts 124
`static_for` nests to `static_ford`. Combined results with #5939: flatmm
**-7.5%** wall time (p<0.01), CodeGen **-10.5%**.

### ASM Equivalence
7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 +
gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale.

## Test plan
- [x] `test_ck_tile_static_ford`: 13 behavioral tests
(identity/non-identity orders, 1D-4D, unit dimensions, edge cases)
- [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map,
is_valid_sequence_map edge cases, sequence_unique_sort map round-trip)
- [x] ASM equivalence verified (980K lines)
- [x] Wilcoxon timing verified (7 trials, flatmm p<0.05)
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 15:25:14 -06:00
Linjun-AMD
ba0efe01af [CK Tile] Add sink token gradient support in FMHA backward pass (#5504)
## Motivation

Adds sink token support to the FMHA backward kernel (dot_do_o pipeline):

## Technical Details

- Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType
- Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs
- Compute per-head sink gradient via atomic accumulation in the pipeline
- Update example runner with reference validation for sink gradient

## Test Plan

Add new test case

## Test Result

WIP

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-02 11:17:01 +08:00
Chinmay Dattanand Kuchinad
820ed2dbb3 [CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776)
## Motivation

Fix pivot mismatch in the persistent GEMM kernel's async input scheduler
that causes **GPU hangs** and incorrect results when used with AsyncTP
(Asynchronous Tensor Parallelism) on ROCm.

PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM
kernel with chunk signals to overlap communication and computation. The
pivot mechanism ensures each rank starts computing from its own local
shard first (which is already available), then moves to remote chunks as
they arrive over the network.

Because of the pivot mismatch, the kernel frequently waits on signals
for chunks that have not yet arrived, while attempting to read data from
completely different chunks. This synchronization desync reliably
triggers infinite hangs during multi-GPU native AsyncTP execution. This
fix is required to enable functional AsyncTP support on ROCm.

## Technical Details

In the persistent kernel loop (`UniversalGemmKernel::operator()`), the
M-tile coordinate used for data selection (`i_m`) and the M-tile
coordinate used for the chunk-signal wait (`chunk_idx`) were derived
from inconsistent bases:

* `i_m` was computed from the **unpivoted** tile index `iM`.
* `chunk_idx` was computed from the **pivoted** expression `(iM +
tile_idx_pivot)`.

This means the kernel could wait for chunk N's signal but then read from
chunk M's memory, or vice versa. The mismatch scales with GPU count:
with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc.

**The Fix:**
Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m`
and `chunk_idx` from it. This guarantees the kernel always waits for the
correct chunk before reading its data.

*(Note: Minor cosmetic `clang-format` changes were also pulled in
alongside the fix).*

## Test Plan

1. Build PyTorch with this CK change.
2. Run the specific multi-GPU AsyncTP native test:
`timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest
test/distributed/test_symmetric_memory.py -k
test_fused_all_gather_matmul_native -q -s -x`

## Test Result

Tests verify correct overlapping execution without hangs or accuracy
mismatches when running the AsyncTP native path with non-zero pivots.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-04-01 09:21:20 -07:00
aledudek
357a140e7b [CK_TILE] Add pooling in tile_engine (#4469)
## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Add pooling in ck tile engine

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-04-01 07:31:46 +00:00
Yi DING
9b8b2456b4 [CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991)
## Motivation

After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard
around the
`num_total_loop <= 0` early-exit check, the IGLP pipeline
(`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV
gradients for
non-masking kernels (even with fix in #5915). Assembly inspection
confirms that the CFG change causes the LLVM
register allocator to reuse AGPR accumulators as scratch destinations in
the dK/dV
reduction loop, breaking the loop-carried accumulation across Q-tile
iterations.

## Technical Details

- Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in
`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing:
it
restores the CFG shape that the register allocator needs to correctly
assign
  dedicated AGPRs to each column of the dK/dV accumulator.
- Only the IGLP pipeline is affected; the other two BWD pipelines do not
exhibit
  this issue.

## Test Plan

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-01 13:44:04 +08:00
Estevan Vedovelli
2510e7b238 [CK] contraction: extend GetTypeString() to include layout-differentiating params (#6022)
## Motivation

Consumers that identify kernels by their `GetTypeString()` (such as
hipTensor's actor-critic kernel selection, which hashes the string into
a
stable cross-platform UID) were silently dropping one of two colliding
variants during registry insertion.

`GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle`
previously
printed 13 template parameters, omitting
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.

These four parameters determine the block-transfer access width and LDS
padding strategy, and are precisely what differentiates the `kk`, `kn`,
`mk`, and `mn` layout variants from one another when all other geometry
parameters are equal. Two instantiations with identical 13-parameter
strings
are distinct C++ types that accept different stride layouts and reject
each
other's arguments via `IsSupportedArgument`.

This patch extends the output to 17 parameters so that every distinct
template instantiation of this class produces a unique
`GetTypeString()`.

## Technical Details


`include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`:
- extend `GetTypeString()` from 13 to 17 parameters including
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.
    
## Test Plan

Build CK and hipTensor with these changes, and verify hipTensor can
differentiate and select the
correct kernels with layout variations.

## Test Result

CK is building correctly and hipTensor is selecting the kernels
correctly.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-31 08:18:11 -07:00
Bartłomiej Kocot
f14ee90152 [CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842)
## Motivation

Force padding for atomic_add bf16 C tensor to avoid memfaults.

## Technical Details

- add global atomic add for bf16 and enable them
- add padding for atomic add bf16 due to the lack of oob
- remove padding for not continous dims in conv for other cases
- minor bwd data conv fixes

## Test Plan

test_grouped_conv_*_tile

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-31 10:02:24 +02:00
jakpiase
0b98317983 [CK_TILE] Changed cshuffle LDS descriptor to naive layout (#5729)
## Motivation
This PR changes gemm/convolution cshuffle layout into plain one. to
improve cshuffle operation performance.

## Technical Details
The purpose is that before this change the cshuffle layout was having
some descriptor transformations that were probably aimed at reducing LDS
bank conflicts, but the transformations itself were terribly slow, which
negatively impacted the performance.

## Test Plan
There is no need for additional tests, since current tests cover this
functionality.
2026-03-31 03:39:03 +00:00
Illia Silin
3873bf3b91 [CK] fix clang lifetimebound errors with staging compiler (#5921)
## Motivation

The ROCm staging compiler (newer Clang) enforces
`[[clang::lifetimebound]]` annotations on methods that return references
or pointers to internal object data. Without these annotations, the
staging compiler emits compilation errors for container accessor methods
across the CK and CK Tile namespaces.

  ## Technical Details

Adds `[[clang::lifetimebound]]` to all reference/pointer-returning
accessors in core container types:

  **`ck::` namespace:**
  - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()`
  - `index_array` -- `operator[]`
  - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()`
  - `IndexLookupTable` -- `operator[]`

  **`ck_tile::` namespace:**
  - `array` -- `get(i)`, `at()`, `operator[]`, `operator()`
  - `static_array` -- `operator[]`
  - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()`
  - `make_kernel()` -- parameter pack

Also removes the unused `instance_index` variable from
`batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing
  accordingly.

  ## Test Plan

- Compile with the staging compiler to verify all lifetimebound errors
are resolved
- Existing tests pass unchanged -- the attribute is a compile-time
annotation with no runtime effect

 ## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 07:19:32 -07:00
Yi DING
4a1abd0e31 [CK_TILE] Fix FMHA BWD register pressure by wrapping num_total_loop with amd_wave_read_first_lane (#5915)
## Motivation

In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.

## Technical Details

Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`

This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 09:44:35 +08:00
Jan Patrick Lehr
d1327bedb7 [CK] More lifetime-warning suppression (#5639)
## Motivation

The staging compiler picked up another change from upstream that leads
to more lifetime-analysis warnings. This breaks the build, given CK is
built with -Werror. As a result, compiler promotion is blocked.

## Technical Details
This patch adds the pragma push diagnostics to ignore the
lifetime-warnings in the modified files to unblock compiler promotion.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-28 11:19:46 +00:00
Linjun-AMD
e40a675f74 [CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095)" (#5849)
This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 20:36:39 +00:00
Bartłomiej Kocot
dad85b964c [CK] Fix min k_batch calculation in conv kernels (#5785)
## Motivation

Avoid division by 0 and remove not needed "-1".

## Technical Details

Our div up implementation return lower value if input is divisible.
There is no need to subtract 1.

## Test Plan

test_grouped_conv_bwd_weight

## Test Result

Passed locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-1019
2026-03-27 15:37:37 +00:00
Johannes Graner
c60514f371 [CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393)
## Motivation

Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.

## Technical Details

StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)

Both persistent and non-persistent data-parallel (DP) sections are
supported.

Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression

### Performance (MI355X, gfx950)

Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):

| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |

StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.

## Test Plan

```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk

# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```

30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression

## Test Result

All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-27 10:17:10 +01:00
Yi DING
8554618d6a [CK_TILE] Fix NaN for FMHA BWD When seq_q=0 (#5790)
## Motivation
This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the
effective query sequence length for a tile is zero, by ensuring the
per-tile pipelines exit early with zeroed accumulators and by avoiding
an early kernel return that prevented writing out cleared gradients.

## Technical Details
- Add unconditional early-exit in the dK/dV pipelines when
`num_total_loop <= 0` (no work), returning zeroed accumulators.
- Adjust group-mode kernel early-return logic to only return when
**both** `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and
store cleared dK/dV when `seqlen_q == 0`.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-27 15:54:01 +08:00
joyeamd
c5202aada0 [CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789)
## Motivation

fix ck_tile's oob check.



## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-26 09:40:44 +08:00
Ville Pietilä
4fa541e262 [CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516)
## Motivation

We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things

- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.

## Technical Details

Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.

---------

Co-authored-by: Ville Pietilä <>
2026-03-25 14:34:13 +00:00
joyeamd
393ce1d23f Revert "Ck/joye/revert oob check (#5640)" (#5697)
This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-23 22:04:14 +00:00
Bartłomiej Kocot
ce4525e82b [CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555)
## Motivation

Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.

## Technical Details

Round up K / Kperxdl and divide it by Kbatch to allow tail for K.

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-21 23:55:24 +01:00
Bartłomiej Kocot
5db7b220d5 [CK][CK Tile] Improve access for merged groups and remove modulo from xor (#5334)
## Motivation

[CK][CK Tile] Improve access for merged groups and remove modulo from
xor

## Technical Details

- add template parameter to xor if modulo is needed. We don't need
modulo for merged groups
- use access by m for merged groups for a tensor
- 
## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 15:45:45 +00:00
Kiefer van Teutem
e8f9bb0c19 [CK_Tile] Refactor amdgcn_mma policy structs (#5272)
## Motivation
The point of this MR is to update the intrinsic layout parameters to
simplify them and make them more clear and flexible. Also, a number of
simple refactors were performed to reduce boilerplate and code
duplication.

## Technical Details
In CK Tile and old CK, the full set of information available in the
intrinsic wrappers, for WMMA and MFMA combined, would be something like:

```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kAMBlock;
static constexpr index_t kBNBlock;

static constexpr index_t kRepeat;
static constexpr index_t kAMLane;
static constexpr index_t kBNLane;
static constexpr index_t kABK0PerLane;
static constexpr index_t kABKLane;
static constexpr index_t kABK1PerLane;

static constexpr index_t kCMLane;
static constexpr index_t kCNLane;
static constexpr index_t kCM0PerLane;
static constexpr index_t kCM1PerLane;

using kABPs2RHssMajor = sequence<2, 1>;
using kABPs2RHssMinor = sequence<1, 0>;
using kABYs2RHsMajor  = sequence<2, 2>;
using kABYs2RHsMinor  = sequence<0, 2>;

using kCPs2RHssMajor = sequence<1, 2>;
using kCPs2RHssMinor = sequence<1, 0>;
using kCYs2RHsMajor  = sequence<1, 1>;
using kCYs2RHsMinor  = sequence<0, 2>;

using kCTPs2RHssMajor = sequence<2, 1>;
using kCTPs2RHssMinor = sequence<1, 0>;
using kCTYs2RHsMajor  = sequence<2, 2>;
using kCTYs2RHsMinor  = sequence<0, 2>;   
 ```
Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to:

```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even
for diff A / B layouts
static constexpr index_t kAKNumAccess; // K2
static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kBKNumAccess; // K2
static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kCMPerLane;   // M2 * M0
static constexpr index_t kCMNumAccess; // M2

// Derived properties
using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;
```

Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features.

Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency.

I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp

Other refactorings I did in this MR:

- Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier.
- Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code.
- Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead.
- Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately).

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 09:07:00 -06:00
joyeamd
d2b129f56b Ck/joye/revert oob check (#5640)
## Motivation

fix ck_tile's oob check. 


## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 12:30:08 +00:00
arai713
121fa1a503 [CK_TILE] Rename Stream-K grid function (#4795)
## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().

## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.

## Test Plan
Validated using the test suite that is already present.

## Test Result
All tests passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 03:27:44 -06:00
Sami Remes
a699df9fdc [CK_TILE] Enable MXFP6 for MX GEMM op (#5095)
## Motivation

Add support for MXFP6 in the MX GEMM op in CK-Tile.

Depends on https://github.com/ROCm/rocm-libraries/pull/4594

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-19 18:07:47 -07:00
Bartłomiej Kocot
aa5ae1c005 [CK][CK Tile] Fix dram step for KM/KN layouts in V1 pipeline (#5470)
## Motivation

Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile
window.

## Technical Details

- Fix dram step for KM/KN layouts in V1 pipeline
- Disable instances which use more threads than warp size in continous
dim (not supported in ck tile yet)
- Use 1x1 specialization for explicit gemm
- Use two stage for vectorsize =1 and sizeof(datatype) ==2
- remove not needed check sinze GetVectorSizeA/B check if vector size is
fixed

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966
2026-03-19 11:59:44 +00:00
assistant-librarian[bot]
39bc8453c6 [CK_TILE] add tf32 support (#4302)
## Proposed changes

TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [ ] Any dependent changes have been merged

## Discussion



---
🔁 Imported from
[ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538)
🧑‍💻 Originally authored by @yingluAMD

---------

Co-authored-by: yingluAMD <Yingmao.Lu@amd.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-19 10:17:20 +01:00
lalala-sh
b0df41c187 [CK] Fix MOE FP8 SplitK buffer descriptor OOB (#5086)
When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base
pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces
are still based on full K dimension. This causes hardware buffer
resource descriptors to extend beyond the actual tensor allocation,
leading to GPU memory access faults when the tensor happens to be placed
at the end of an allocated memory pool region.

Fix by subtracting the split offset from each buffer's element space in
both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer
descriptor range [shifted_base, shifted_base + reduced_space) exactly
covers the valid allocation.

Also refactor SplitKBatchOffset to accept const Problem& (instead of
Argument&) and add a default constructor, enabling direct reuse in
Run/Run_2Lds without duplicating offset calculation logic.

Made-with: Cursor

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2026-03-19 10:41:53 +08:00
Christopher Millette
4fdcfab53d [CK] Replace nested static_for with static_ford to reduce device IR function emissions [1B] (#5031)
## Summary

### Rationale
CK's GPU kernels are among the slowest files in the ROCm build, with a
single translation
unit taking up to 10+ minutes. Profiling with `-ftime-trace` identified
nested `static_for`
loops as the root cause: each nesting level multiplies the number of
unique lambda IR
functions the compiler must process. A 2-level nest of `static_for<0, M,
1>` /
`static_for<0, N, 1>` produces M×N unique lambda types. With typical
GEMM dimensions
(M=16, N=4), a single nest generates 64 unique functions — and these
nests appear hundreds
of times across the codebase.

The LLVM backend's CGSCC (Call Graph Strongly Connected Components)
framework processes
each function independently, so reducing function count directly reduces
backend time.

### What changed
393 nested compile-time loop patterns across 73 files are converted to
`static_ford`, which
flattens multi-dimensional compile-time iteration into a single
`static_for` with index
decomposition. This eliminates 994 `static_for` nesting levels (42%
reduction).

Three pattern categories were converted:
- **Category A**: `static_for` wrapping `static_ford` — fold outer
dimension into ford
- **Category B**: nested `static_ford` — merge into single
higher-dimensional ford
- **Category C**: nested `static_for` chains — convert to single
`static_ford`

### Verification

**ASM equivalence: PASS — 51/51 device assembly files identical (gfx942
+ gfx1100)**

| Architecture | Files compared | Largest file | Result |
|---|---|---|---|
| gfx942 | 36 | 386,685 lines | ALL MATCH |
| gfx1100 | 15 | 47,769 lines | ALL MATCH |

**Build time (Wilcoxon signed-rank test, 7 paired trials):**

| Target | Pre (s) | Post (s) | Delta | p-value |
|---|---|---|---|---|
| bscale | 169 | 152 | **-9.8%** | 0.016 \* |
| xdl_v1234 | 207 | 194 | **-6.6%** | 0.016 \* |
| preshuffle | 275 | 264 | **-3.9%** | 0.016 \* |
| xdl_base | 142 | 137 | **-3.2%** | 0.031 \* |

**IR function counts (device backend, gfx942):**

| Target | InstFunc Δ | CodeGen Δ | Compiler Δ |
|---|---|---|---|
| bscale | -13,043 (-8.2%) | -2,103 (-3.5%) | -10.7% |
| xdl_v1234 | -9,431 (-5.7%) | +59 (+0.1%) | -5.2% |
| xdl_base | -6,162 (-4.9%) | -1,141 (-2.5%) | -2.2% |
| xdl_old | -3,234 (-3.7%) | -963 (-8.7%) | -3.3% |

### Value
- **994 fewer `static_for` nesting levels** (-42%) across 73 files
- **393 `static_ford` sites** created (from 4 pre-existing)
- **Up to 9.8% compile-time reduction** on representative targets
(statistically significant, p < 0.05)
- **Up to 13K fewer IR function instantiations** per translation unit
- Net -849 LOC from reduced indentation
- **Zero ASM changes** — identical device code output verified on gfx942
and gfx1100
- All scheduling barriers, `if constexpr` guards, and MFMA/WMMA
accumulation order preserved

### Files changed (73)
- `block/`: 47 files (GEMM pipelines — xdlops, wmma, moe, preshuffle,
blockscale variants)
- `grid/`: 20 files (softmax, normalization, reduction, attention,
layernorm)
- `thread/`: 5 files (tensor slice transfer, contraction, GEMM dlops,
reduction)
- `tensor_description/`: 1 file (tensor_adaptor)

## Test plan
- [x] `static_ford` tested with 21 unit tests in
`test/util/unit_ford.cpp`
  (1D-4D, custom orders, compile-time verification)
- [x] All conversions preserve iteration order, `block_sync_lds()`
placement,
  `if constexpr` scheduling guards, and MFMA/WMMA accumulation order
- [x] ASM equivalence verified: 51 device `.s` files across gfx942 +
gfx1100
- [x] Build-time improvement statistically confirmed (Wilcoxon, p <
0.05, 4 targets)
- [x] IR function count reduction confirmed via `-ftime-trace` on 7
targets
- [x] Detection script reports 0 remaining safe patterns (180 blocked
with structural reasons)
- [x] Existing CI tests (GEMM, softmax, normalization, batch norm,
reduction,
  attention) exercise all converted code paths

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 08:45:22 -06:00
Thomas Ning
75357722b8 CK Tile MX GEMM Packing Improvement (#5323)
## Motivation

Reduce the scale loading size and also has better utilization of MFMA
scale selection.

## Technical Details

Add up the packing of mx scales.

## Test Plan

Use the existing test cases.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
2026-03-17 11:57:32 -07:00
Hosang
b5894b3cbe [CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018)
## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.

This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.

In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.

## Technical Details
This PR adds two complementary strategies:

- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
  - Estimate LLC size (env override, KFD sysfs, or arch default).
  - Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).

- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
  - Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
     improving temporal locality of per-head K/V reuse.

Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
    deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.

## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`

## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x

- TFLOPs by sequence length target: gfx1201 layout: bshd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 21:18:34 +00:00
Bartłomiej Kocot
1e1f3647f7 [CK][CK Tile] Grouped Convolution backward weight profiler flush cache (#5454)
## Motivation

Flush cache to get more stable results during profiling old ck and ck
tile.

## Technical Details

Flush cache before each kernel call and one more first run.

## Test Plan

test_grouped_conv_bwd_weight_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 17:46:21 +00:00
lalala-sh
4f091cacd0 [CK] fix moe memset size which is bigger than alloc (#5225)
## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 17:30:07 +08:00
Enrico Degregori
a9a8a57786 [CK Tile] Eight Waves pipeline GEMM (#4964)
## Motivation

Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM

## Technical Details

Summary:
 - Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
   - Use this block struct in ABQuant for encodings
 - Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
 - Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves

So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts

## Test Plan

Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).

Note: K padding test is disabled for this pipeline because it's not
implemented yet

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 09:30:54 +01:00
Yi DING
f6bfcad437 [CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174)
## Motivation
This PR enables a persistent-kernel execution path for FMHA backward
(dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is
split, stored, and converted back to final gradients.

## Technical Details
- Introduces a persistent-kernel grid mapping in deterministic mode and
updates split-count calculation accordingly.
- Extends kernel kargs to carry batch-related info needed for persistent
scheduling and dQ conversion.
- Refactors dQ store conditions and adds mask-type traits/utilities and
runner logging updates.

## Test Plan
- Jenkins
[base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline)
- Jenkins
[AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline)
- Jenkins
[FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline)
- local FA tests

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-13 14:13:32 +08:00
Ville Pietilä
5d2cbd1117 [CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK Tile profiler (#5237)
## Motivation

PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The
two-stage kernels were not supported in the initial PR. This PR adds the
the missing bwd weight two-stage kernels to the CK Profiler.

## Technical Details

Extended the CK Tile conv builder factory to build also the elementwise
ops required for the two-stage kernels. Extended the CK Builder for CK
Tile instance to accept the two-stage flag as part of the algorithm
configuration.

## Test Plan

Added units tests for CK Builder that verify the two-stage kernel
construction.
  
## Test Result

If CI passes, the added unit tests are passing.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Ville Pietilä <>
2026-03-12 19:20:15 -06:00
Adam Osewski
7a12ecc762 [CK TILE] Skip work if any of Grouped GEMM groups M/N/K are zero. (#5050)
## Motivation

It's common in MoE workloads that some experts receive zero tokens,
which would result in some of the dimensions equal to zero. Currently we
handle such case only for non-persistent kernels where we have all GEMMs
information beforehand on host - we validate this during creation of
kernel arguments. However for the "dynamic" input path (persistent
kernel) this information is not available before kernel launch. Thus we
have to validate this during kernel execution. The goal is to add this
validation.

## Technical Details

Skip work if any of Grouped GEMM groups M/N/K are zero for persistent
kernel path.

## Test Plan

Add unit-tests which cover "dynamic" inputs with zero dims for
persistent kernel execution path.

## Test Result

All tests pass.

## Submission Checklist

- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 13:28:24 +00:00
chris-tsiaousis-hpc
9c1c4c9168 Changed the include order of the new WMMA/MFMA unification framework (#5241)
Those changes are to fix the include order and make header files
independent of one another. Also the `remod.py` sript has run and
changed the `grouped_convolution.hpp` and `core.hpp` files.

## Motivation

Some headers appear to depend on include order.
For example, when moving `#include "wmma/wmma.hpp"` in
[amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp)
later in the include list, it is causing compilation errors. Also the
pre-commit script `remod.py` is shuffling includes to be in alphabetical
order and is causing compilation issues.

Expected behaviour:
Headers should be independent of one another: no header should require
another to be included first. Each header should compile correctly on
its own.

## Test Plan

The CI (that runs `remod.py`) should compile.

## Test Result

Existing CI should compile and be green.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-03-12 09:26:58 +01:00
Bartłomiej Kocot
86ac3213ed [CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114)
## Motivation

Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details

- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-786
2026-03-11 23:38:15 +01:00