Commit Graph

749 Commits

Author SHA1 Message Date
Po Yen Chen
5cab2c3565 [CK] Skip fp16 dropout d256 batch tests for compiler VGPR aliasing bug (#6342)
## Summary

- Skip fp16 FMHA forward dropout tests that use the d256 tile in batch
mode, gated on compiler version
- The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of
Philox RNG parameters under high register pressure (383 VGPRs)
- bf16 dropout tests are unaffected and cover the same code paths

## Root Cause

The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state
stored in VGPRs) with other live data during the softmax main loop. This
causes corrupted `buffer_store_byte` writes for dropout randval on wave
lanes 32-63, producing NaN in output and LSE tensors.

**Conditions:** fp16 + d256 tile + dropout + batch mode + `qr` pipeline
+ gfx90a

## Changes

- `include/ck_tile/core/config.hpp`: Add
`CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro
- `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in
`TEST_P(Dropout, ...)`

## Test plan

- [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip
active)
- [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED
- [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)
2026-04-14 14:07:20 +00:00
Yaswanth Raparti
f603d75f20 [CK][CK TILE] Modify elementwise kernel template signature to accept independent type arguments (#6399)
## Motivation

modify elementwise kernel template signature to fix cshuffle epilogue
build error

## Technical Details

Encountered a build error while building conv fallback kernel with
dispatcher.
Error: Type mismatch in `ElementWiseKernel::operator()` where the
template required all three parameters (lens, input_strides,
output_strides) to be the same type, but the CShuffle epilogue was
passing them with different tuple element types.

Solution: Modified the template signature in elementwise_kernel.hpp to
accept three independent type parameters:

Changed from single typename `Dims` to typename `DimsLens`, typename
`DimsInStrides`, typename `DimsOutStrides`
Updated references to `Dims::size()` to use the appropriate specific
type

## Test Plan

- Test with dispatcher conv unit tests
- Relying on CI tests

## Test Result
- Dispatcher unit tests passed
- Relying on CI tests

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-14 01:44:27 -06:00
chris-tsiaousis-hpc
e4bdc5d91a [CK Tile] Unification work - mma transformations pipeline (#5508)
## Motivation

In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing.
For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform.

A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR.
The pipeline base class is basically at the RFC stage.

We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline.

## Technical Details

The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed.

## Test Plan

New tests are added.

## Test Result

Tests should pass.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-04-14 09:25:01 +02:00
Brock Hargreaves
e939299c69 [MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400)
Reverts ROCm/rocm-libraries#6343

This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.
2026-04-13 20:46:07 -06:00
Ville Pietilä
1b2a619107 [CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343)
## Motivation

Due to compiler version update, there are test failures in the test
suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current `develop` branch, the test
failures are blocking any PR merges into `develop`. An example of a
failed CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as tests are passing for clang compiler version 20.0 and failing
for clang compiler version 22.0.

## Technical Details

This PR disables the compilation of the problematic bwd weight conv
instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that
allows us to detect when we are compiling for `gfx90a`. Using the new
CMake flag, compilation of instance
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`.

Co-authored-by: Ville Pietilä <>
2026-04-13 13:40:27 +02:00
Kiefer van Teutem
9fe98c864f [CK Tile] Add Tile Distribution Encoding Calculator (#5515)
## Motivation

We want to be able to calculate TileDistributionEncodings describing
register mappings for any MmaOp. This is necessary for further
integration with CK Tile.

This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma
type (MmaOp) and provides ABC warp distribution encodings for mapping
matrix fragment coordinates to register coordinates
(lane, vector item) and vice versa. It is able to take CTranpose,
Swizzle, and NumAccessA / NumAccessB template parameters for tweaking
the tile distributions. Swizzle modification will be implemented later.

The current implementation can deal with all intrinsic types and
block-hiding.

This MR also adds some additional static asserts and derived params
within amdgcn_mma_base, to enforce consistency and help calculate Tile
Distributions for block-hiding intrinsics.

An Example was added that uses the Tile Distr Enc Calc to calc and print
register layouts for Tile Distributions for some of our amdgcn_mma
structs. It also makes sure that the CTranspose modifier works as
intended.

Some additional gfx9 intrinsics were added to test block-hiding layouts
for the different types of C-block-hiding layouts.

The sparse intrinsic wrappers were updated according to Chris's recent
changes in another branch
(https://github.com/ROCm/rocm-libraries/pull/5508), which moved the
compression step outside of the intrinsic itself. This is necessary to
make sure that the Calculator can deal with this new interpretation of
the sparse intrinsics. I directly copied the new amdgcn structs from
Chris's branch and changed nothing else to avoid more complex merges in
the future. Note that this means I did not update a bunch of related
sparse code since that would be a lot, and therefore I disabled
test_amdgcn_sparse_mma for now.

The amdgcn_mma_layout test was refactored a bit:
- The old register mapping utility was removed and its use was replaced
by the new TileDistrEncCalc
- More tests were added to test layouts for different types of
block-hiding and sparse intrinsics
- The Selector method was removed and the tests were split up over
target architectures, with each target arch having a direct list of
amdgcn structs to be tested. This ensures that we force specific tests
on specific architectures and makes sure that the selector doesn't
quietly do some workarounds like creating compound intrinsics.

## Test Results

Layout tests based on calculated tile distribution encodings pass on all
architectures. Calculator works for all currently added amdgcn structs,
which includes different types of block-hiding and sparse intrinsics.
Printed layouts from new example verified by eye. CTranspose modifier
tested for large set of intrinsics.
2026-04-13 08:00:31 +00:00
Aviral Goel
2ff7ac5abc CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302)
Depends on #6300 

## Summary

Remove 41 commented-out code blocks across 33 files in Composable
Kernel, totaling ~200 lines.

Identified using an automated dead code scanning skill (`ck-dead-code`)
with a calibrated two-stage pipeline:
1. **Pre-filter**: Keyword-based scan found 1,338 `//`-commented blocks.
Calibrated heuristics (trained on 50-sample expert classification)
reduced to 89 high-confidence candidates — 93% noise reduction.
2. **Expert triage**: LLM expert classified each block in context as
CODE_REMOVE, CODE_KEEP, or NOT_CODE.

| Classification | Count |
|---------------|-------|
| Removed (this PR) | 41 |
| Kept (debug helpers, alt configs, reference impls) | 32 |
| Not code (false positives) | 16 |

Removed blocks include: superseded implementations, old test data,
abandoned stubs, unreachable code, and buggy dead code.
2026-04-10 11:17:11 -04:00
Hosang Yoon
fb22cd0c69 [CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12 (#6156)
## Motivation
On gfx11/gfx12, FMHA forward kernels that require head-dim padding show
a large performance drop compared to the exact-head-dim path. In
practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too
far off the fast path.

This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while
keeping the behavior for other GPUs unchanged.

## Technical Details

- Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path
for gfx11/gfx12.
- For `receipt=0`, keep support conservative and only enable the padded
fast path for vector-safe cases (`head_dim % 8 == 0`), matching the
existing assumption used on other GPUs.
- Move `v_prefetch` later only for the head-dim-padded path on
gfx11/gfx12. This reduces live ranges and removes the register-spill
behavior seen in the earlier scheduling.
- Enable the buffer-load OOB check offset trick for the padded path on
gfx11/gfx12.

## Test Plan

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result

Observed padded-head-dim performance improvements for HDIM=72/80:

- gfx11: about ~3.5x
- gfx1151: about ~2.0x
- gfx12: about ~1.3x


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-08 10:51:53 -04:00
Christopher Millette
7816812ef8 [CK_TILE] Flatten nested static_for loops into static_ford (#5939)
## Summary
Mechanical conversion of 129 nested `static_for`/`static_ford` patterns
to flat `static_ford` across 29 ck_tile header files.

Each conversion eliminates intermediate lambda closure instantiations by
replacing nested compile-time loops with a single flat iteration using
index decomposition.

### What `static_ford` eliminates

When `static_for` loops are nested, each level creates unique closure
types:
```cpp
// BEFORE: M + M×N = 20 IR functions (for M=4, N=4)
static_for<0, 4, 1>{}([&](auto m) {        // 4 closure instantiations
    static_for<0, 4, 1>{}([&](auto n) {     // 4×4 = 16 closure instantiations
        body(m, n);
    });
});

// AFTER: M×N = 16 IR functions (with ford_applier, no intermediates)
static_ford<sequence<4, 4>>{}([&](auto mn) {
    constexpr auto m = number<mn[number<0>{}]>{};
    constexpr auto n = number<mn[number<1>{}]>{};
    body(m, n);
});
```

### Pattern categories converted

| Category | Count | Description |
|----------|-------|-------------|
| C (2-level `static_for` chains) | 112 | Nested `static_for` →
`static_ford` |
| C3 (3-level `static_for` chains) | 9 | Three consecutive nests →
`static_ford` |
| Partial rescue | 3 | Outer 2 levels of blocked 4-level nests |
| B (nested `static_ford` merge) | 5 | Two nested `static_ford` → single
higher-dim `static_ford` |
| **Total** | **129** | Across 29 files |

6 false positives were detected and reverted (in `tensor_adaptor.hpp`,
`tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the
inner loop bound depended on the outer variable.

### Files changed by family

| Family | Files | Sites |
|--------|-------|-------|
| Block GEMM | 12 | ~20 |
| FlatMM pipelines | 4 | ~69 (including 5 ford-ford merges) |
| GEMM quant | 7 | ~22 |
| FlatMM kernel | 1 | 2 |
| FMHA | 1 | 2 |
| Reduce/norm | 2 | 2 |
| Epilogue | 1 | 1 |

### Blocked locations from review comments

- **block_gemm_areg_breg_creg_v1.hpp:356** — BLOCKED: runtime scale
loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between
every nesting level
- **block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228** — BLOCKED:
`zero_accumulators()` before inner loop; `sched_barrier` + conditional
`block_sync_lds()` after inner loop
- **block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298** — BLOCKED:
runtime `CWarpTensor` construction before inner loop; quantization scale
application code after inner loop
- **block_universal_gemm_as_aquant_bs_cr.hpp:277** — BLOCKED: same
pattern as above
- **block_universal_gemm_as_bs_bquant_cr.hpp:367** — BLOCKED: same
pattern as above

## Depends on
- #5938 ([CK_TILE] Optimize static_ford and sequence compile-time
infrastructure) — provides the `ford_applier` that makes these
conversions beneficial. Without it, `static_ford` uses a recursive
implementation that provides no IR function savings.

## Results (combined with #5938)

### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942)

| Target | Base (s) | Treat (s) | Delta | % | Significant? |
|--------|----------|-----------|-------|---|-------------|
| **flatmm** | 161.1 | 149.0 | **-12.1s** | **-7.5%** | **YES** (p<0.01,
7/7 wins) |
| **universal_gemm** | 225.4 | 220.3 | **-5.1s** | **-2.3%** | **YES**
(p<0.01, 7/7 wins) |

### IR Function Counts (device trace, gfx942)

| Target | InstFunc | CodeGen |
|--------|----------|---------|
| universal_gemm | **-8.5%** | **-9.2%** |
| flatmm | **-7.6%** | **-10.5%** |

### ASM Equivalence
5/5 PASS — 650,151 lines verified identical (gfx942). TUs:
universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale.

## Test plan
- [x] ASM equivalence verified (650K lines, gfx942)
- [x] Wilcoxon timing verified (7 trials, p<0.01)
- [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction)
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2026-04-07 08:36:45 -06:00
Po Yen Chen
341fb33386 [CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051)
## Motivation

The existing FMHA V3 pipeline only supports fp16/bf16 data types. This
PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950,
enabling higher throughput for
  FP8 inference workloads using the assembly-optimized V3 code path.

  ## Technical Details

  **Warp GEMM:**
- Add FP8 32x32x32 warp gemm with C-transposed distribution
(`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries

  **V3 Kernel (`fmha_fwd_v3_kernel.hpp`):**
- Add per-tensor descale support for Q, K, V tensors, passing descale
pointers through to pipeline kargs

  **V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`):**
  - Add FP8 data path with dtype-aware type selection
  - Add asm volatile P matrix conversion from f32 to fp8
  - Add FP8-aware instruction scheduling in `CoreLoopScheduler`

**V3 Pipeline Policy
(`block_fmha_fwd_v3_pipeline_default_policy.hpp`):**
- Add FP8 QK warp gemm selection (SwizzleB variant for V tile
distribution compatibility)

  **Codegen (`fmha_fwd.py`):**
  - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128)
- Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale:
no/pertensor)
  - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor

  **Misc:**
- Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum
(`arch.hpp`)
- Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`)

V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`)
pending further validation.

## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950)

  | Problem | Regular (TFlops) | Varlen (TFlops) |
  |---|---:|---:|
  | batch=1 heads=6/1 seqlen=1024 causal | 48.9 | 47.6 |
  | batch=1 heads=6/1 seqlen=2048 causal | 119.8 | 117.4 |
  | batch=1 heads=6/1 seqlen=4096 causal | 263.7 | 259.2 |
  | batch=1 heads=6/1 seqlen=8192 causal | 548.9 | 543.6 |
  | batch=1 heads=6/1 seqlen=16384 causal | 1043.0 | 1063.7 |
  | batch=1 heads=6/1 seqlen=32768 causal | 1237.2 | 1279.6 |
  | batch=1 heads=6/1 seqlen=65536 causal | 1315.4 | 1382.7 |
  | batch=1 heads=6/1 seqlen=131072 causal | 1326.3 | 1402.2 |
  | batch=1 heads=16/1 seqlen=65536 causal | 1298.7 | 1388.4 |
  | batch=1 heads=40/40 seqlen=37200 non-causal | 1248.9 | 1326.1 |

## Test Plan

Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering
batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40),
GQA ratios (1:1, 1:8), and
causal/non-causal modes. Verified all cases dispatch to the V3 pipeline
by enabling `F_is_v3_enabled` and confirming kernel names contain
`qr_async_trload_v3`.

  ## Test Result

176/176 tests passed with V3 enabled. All cases correctly dispatched to
V3 pipeline with `pertensor` quantization.

  ## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-07 22:19:28 +08:00
Jeff Huang
8a29683326 [CK_TILLE] Temporarily remove batch prefill KV cache overflow asserts (#6201)
## Summary
- Temporarily remove the KV cache offset overflow assert checks in
`FmhaBatchPrefillWithPagedKVCacheKernel`
- The asserts are **correct**, but they block project progress in
certain configurations
- This is a **temporary workaround** to unblock progress; a proper fix
will follow

## Note
This is NOT a permanent solution. A follow-up PR will add proper
overflow handling that addresses the underlying issue without blocking
progress.
2026-04-07 20:41:24 +08:00
Bartłomiej Kocot
a9b3eaffb8 [CK][CK Tile] Conv Bwd Data flush cache and profiling improvements (#6090)
## Motivation

Improve accuracy of conv bwd data perf measurements

## Technical Details
- enable flush cache
- for grouped conv we zero conv input(gemm output) inside device op, so
we also include this in time measurement
- for non-grouped conv we zero conv input(gemm output) outside device op
(in profile_conv_bwd_data_impl.hpp) so it is not included.
- In this pr I changed it to include zeroing if time_kernel/flush cache
is enabled so at now you should have more fair comparison. I changed it
only for time_kernel/flush_cache because MIOpen run own zeroing for
non-grouped solvers.

## Test Plan

test_grouped_conv_bwd_data_*

## Test Result

CI pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-04 00:22:22 +00:00
harkgill-amd
e3f31255f3 Add missing gfx1033 to gfx103 group definition in ck (#5141)
## Motivation

Resolving PyTorch build failures when enabling builds for gfx103X-all
family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033`
is the only failing architecture in the family and the failures point to
missing support in CK.

## Technical Details

PyTorch build fails with repeated error message
```
/__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
   33 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
      |                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
`gfx1033` is missing from the `__gfx103__` group which results in
`CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in
`gfx1033` to the missing files which should be the minimum fix to allow
torch builds to pass.

## Test Plan

Compile sample test file and target gfx1033
```
...
#ifdef __HIP_DEVICE_COMPILE__
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value");
#else
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value");
#endif
```

## Test Result

Prior to the applying patch, compilation fails with `error: use of
undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'`

After applying patch, test file compiles successfully.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-04-03 13:44:38 -06:00
Thrupti Raj Lakshmana Gowda
88310f86b9 [CK Tile] Fix architecture-dependent EightWave assignment in cshuffle_epilogue (#6102)
## Motivation
Tile engine CI build on the develop branch started failing after a
recent change(https://github.com/ROCm/rocm-libraries/pull/5218) in
`cshuffle_epilogue.hpp`. The `EightWave` constant was unconditionally
computed as `(MWave * NWave == 8)` for all architectures, but this logic
is only valid for gfx9*. On other architectures (e.g., gfx1201),
`EightWave` must always be `false`, otherwise it leads to incorrect
`BlockedXDLN_PerWarp` computation and build failures.

## Technical Details
In `cshuffle_epilogue.hpp`, the `EightWave` static constexpr was set as:
```cpp
static constexpr bool EightWave = (MWave * NWave == 8);
```
This was applied regardless of the target GPU architecture. The fix uses
a preprocessor guard to make this architecture-aware:
- **gfx9* (`__gfx9__`):** `EightWave` is evaluated as `(MWave * NWave ==
8)` — true or false depending on the wave configuration
- **All other architectures:** `EightWave` defaults to `false`

## Test Plan
- Tile engine CI build on develop branch

## Test Result
- *Pending CI*

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---
> **Note:** This PR supersedes ROCm/rocm-libraries#5436, which is
blocked pending a review approval from a reviewer currently on PTO. The
same changes have been applied to this branch
(`users/tlakshma/ck/develop-clone`) to allow merging.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-04-02 20:55:03 -06:00
Christopher Millette
8af6e10be4 [CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938)
## Problem

Each `static_for<0, N, 1>` instantiates its lambda N times (one per
`number<I>` type). When nested, intermediate lambdas capture the outer
loop variable (a different type per iteration), creating unique closure
types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32
= 52 IR functions, of which 20 are intermediate closures that get
inlined away but still cost frontend compile time.

ck_tile's `static_ford` was supposed to eliminate these intermediates
(as old CK's PR #5031 did successfully), but it used a **recursive**
`static_ford_impl` that recreated the same closure pattern plus added
`reorder_old_to_new`/`reorder_new_to_old` overhead.

Additionally, the sequence utility layer (`sequence_sort`,
`is_valid_sequence_map`) used recursive template metaprogramming that
generated O(N log N) intermediate types for every permutation validation
— called on every `reorder_new_to_old`/`reorder_old_to_new` invocation.

## Changes

### 1. Replace `sequence_sort` with constexpr insertion sort
Replace recursive merge sort (`sequence_sort_impl` +
`sorted_sequence_merge_impl`, O(N log N) intermediate type
instantiations) with constexpr insertion sort using `static_array`. O(1)
template depth, same `::type` and `::sorted2unsorted_map` API.

### 2. Replace `is_valid_sequence_map` with constexpr check
Replace sort-based permutation validation (which instantiated the full
`sequence_sort` chain) with a constexpr "seen array" loop. O(N)
constexpr steps instead of O(N log N) template instantiations.

### 3. Replace recursive `static_ford` with flat-loop `index_decomposer`
Replace `static_ford_impl` (recursive `static_for` nesting +
`pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat
`index_decomposer` using pre-computed strides. Add `decompose_reordered`
alias that folds reordering into decomposition, and `inverse_perm`
helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map`
→ `sequence_sort` chain.

### 4. Eliminate internal lambda via `ford_applier`
The flat-loop approach still used `static_for` with a lambda, creating
M×N internal lambda instantiations per call site. Replace with
`ford_applier` struct that calls `f(decompose<I>{})` directly via fold
expression — zero intermediate closures:

```cpp
// Before: 2×M×N function instantiations
static_for<0, M*N, 1>{}([&](auto i) { f(decompose<i>{}); });

// After: M×N function instantiations (50% reduction)
ford_applier<Decomposer, make_index_sequence<M*N>>{}(f);
```

Also unified identity and non-identity order paths into a single
template with `constexpr if`.

### 5. Fix const-qualified sequence handling
Fix `is_valid_sequence_map` to handle const-qualified sequence types via
`remove_cvref_t` in callers (`tensor_adaptor.hpp`,
`tile_distribution_encoding.hpp`).

## Results (this PR only, without flattening)

### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5)

| Target | Base (s) | Treat (s) | Delta | % | Wins | Significant? |
|--------|----------|-----------|-------|---|------|-------------|
| **flatmm** | 160.1 | 152.7 | **-7.4s** | **-4.6%** | 6/7 | **YES**
(W+=1, p<0.05) |
| universal_gemm | 228.4 | 224.7 | -3.7s | -1.6% | 6/7 | Trending (W+=4)
|

Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5]
Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6]

### IR Function Counts (device trace, gfx942)

| Target | Metric | Before | After | Delta | % |
|--------|--------|--------|-------|-------|---|
| **universal_gemm** | InstantiateFunction | 117,715 | 109,165 |
**-8,550** | **-7.3%** |
| **universal_gemm** | CodeGen Function | 47,912 | 45,044 | **-2,868** |
**-6.0%** |
| **flatmm** | InstantiateFunction | 100,939 | 95,127 | **-5,812** |
**-5.8%** |
| **flatmm** | CodeGen Function | 42,651 | 40,367 | **-2,284** |
**-5.4%** |

Note: The `ford_applier` (commit 3) has minimal additional effect in
this PR since ck_tile code does not yet use `static_ford` extensively.
Its impact compounds when the follow-up flattening PR #5939 converts 124
`static_for` nests to `static_ford`. Combined results with #5939: flatmm
**-7.5%** wall time (p<0.01), CodeGen **-10.5%**.

### ASM Equivalence
7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 +
gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale.

## Test plan
- [x] `test_ck_tile_static_ford`: 13 behavioral tests
(identity/non-identity orders, 1D-4D, unit dimensions, edge cases)
- [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map,
is_valid_sequence_map edge cases, sequence_unique_sort map round-trip)
- [x] ASM equivalence verified (980K lines)
- [x] Wilcoxon timing verified (7 trials, flatmm p<0.05)
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 15:25:14 -06:00
Linjun-AMD
d06e2bfa2f [CK Tile] Add sink token gradient support in FMHA backward pass (#5504)
## Motivation

Adds sink token support to the FMHA backward kernel (dot_do_o pipeline):

## Technical Details

- Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType
- Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs
- Compute per-head sink gradient via atomic accumulation in the pipeline
- Update example runner with reference validation for sink gradient

## Test Plan

Add new test case

## Test Result

WIP

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-02 11:17:01 +08:00
Chinmay Dattanand Kuchinad
d71f5d005a [CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776)
## Motivation

Fix pivot mismatch in the persistent GEMM kernel's async input scheduler
that causes **GPU hangs** and incorrect results when used with AsyncTP
(Asynchronous Tensor Parallelism) on ROCm.

PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM
kernel with chunk signals to overlap communication and computation. The
pivot mechanism ensures each rank starts computing from its own local
shard first (which is already available), then moves to remote chunks as
they arrive over the network.

Because of the pivot mismatch, the kernel frequently waits on signals
for chunks that have not yet arrived, while attempting to read data from
completely different chunks. This synchronization desync reliably
triggers infinite hangs during multi-GPU native AsyncTP execution. This
fix is required to enable functional AsyncTP support on ROCm.

## Technical Details

In the persistent kernel loop (`UniversalGemmKernel::operator()`), the
M-tile coordinate used for data selection (`i_m`) and the M-tile
coordinate used for the chunk-signal wait (`chunk_idx`) were derived
from inconsistent bases:

* `i_m` was computed from the **unpivoted** tile index `iM`.
* `chunk_idx` was computed from the **pivoted** expression `(iM +
tile_idx_pivot)`.

This means the kernel could wait for chunk N's signal but then read from
chunk M's memory, or vice versa. The mismatch scales with GPU count:
with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc.

**The Fix:**
Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m`
and `chunk_idx` from it. This guarantees the kernel always waits for the
correct chunk before reading its data.

*(Note: Minor cosmetic `clang-format` changes were also pulled in
alongside the fix).*

## Test Plan

1. Build PyTorch with this CK change.
2. Run the specific multi-GPU AsyncTP native test:
`timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest
test/distributed/test_symmetric_memory.py -k
test_fused_all_gather_matmul_native -q -s -x`

## Test Result

Tests verify correct overlapping execution without hangs or accuracy
mismatches when running the AsyncTP native path with non-zero pivots.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-04-01 09:21:20 -07:00
aledudek
c672f410e6 [CK_TILE] Add pooling in tile_engine (#4469)
## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Add pooling in ck tile engine

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-04-01 07:31:46 +00:00
Yi DING
31e1965997 [CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991)
## Motivation

After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard
around the
`num_total_loop <= 0` early-exit check, the IGLP pipeline
(`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV
gradients for
non-masking kernels (even with fix in #5915). Assembly inspection
confirms that the CFG change causes the LLVM
register allocator to reuse AGPR accumulators as scratch destinations in
the dK/dV
reduction loop, breaking the loop-carried accumulation across Q-tile
iterations.

## Technical Details

- Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in
`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing:
it
restores the CFG shape that the register allocator needs to correctly
assign
  dedicated AGPRs to each column of the dK/dV accumulator.
- Only the IGLP pipeline is affected; the other two BWD pipelines do not
exhibit
  this issue.

## Test Plan

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-01 13:44:04 +08:00
Bartłomiej Kocot
3f31a8931f [CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842)
## Motivation

Force padding for atomic_add bf16 C tensor to avoid memfaults.

## Technical Details

- add global atomic add for bf16 and enable them
- add padding for atomic add bf16 due to the lack of oob
- remove padding for not continous dims in conv for other cases
- minor bwd data conv fixes

## Test Plan

test_grouped_conv_*_tile

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-31 10:02:24 +02:00
jakpiase
be6d9bafa8 [CK_TILE] Changed cshuffle LDS descriptor to naive layout (#5729)
## Motivation
This PR changes gemm/convolution cshuffle layout into plain one. to
improve cshuffle operation performance.

## Technical Details
The purpose is that before this change the cshuffle layout was having
some descriptor transformations that were probably aimed at reducing LDS
bank conflicts, but the transformations itself were terribly slow, which
negatively impacted the performance.

## Test Plan
There is no need for additional tests, since current tests cover this
functionality.
2026-03-31 03:39:03 +00:00
Illia Silin
1dd32787a5 [CK] fix clang lifetimebound errors with staging compiler (#5921)
## Motivation

The ROCm staging compiler (newer Clang) enforces
`[[clang::lifetimebound]]` annotations on methods that return references
or pointers to internal object data. Without these annotations, the
staging compiler emits compilation errors for container accessor methods
across the CK and CK Tile namespaces.

  ## Technical Details

Adds `[[clang::lifetimebound]]` to all reference/pointer-returning
accessors in core container types:

  **`ck::` namespace:**
  - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()`
  - `index_array` -- `operator[]`
  - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()`
  - `IndexLookupTable` -- `operator[]`

  **`ck_tile::` namespace:**
  - `array` -- `get(i)`, `at()`, `operator[]`, `operator()`
  - `static_array` -- `operator[]`
  - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()`
  - `make_kernel()` -- parameter pack

Also removes the unused `instance_index` variable from
`batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing
  accordingly.

  ## Test Plan

- Compile with the staging compiler to verify all lifetimebound errors
are resolved
- Existing tests pass unchanged -- the attribute is a compile-time
annotation with no runtime effect

 ## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 07:19:32 -07:00
Yi DING
42b3728f46 [CK_TILE] Fix FMHA BWD register pressure by wrapping num_total_loop with amd_wave_read_first_lane (#5915)
## Motivation

In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.

## Technical Details

Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`

This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 09:44:35 +08:00
Jan Patrick Lehr
07aacceacd [CK] More lifetime-warning suppression (#5639)
## Motivation

The staging compiler picked up another change from upstream that leads
to more lifetime-analysis warnings. This breaks the build, given CK is
built with -Werror. As a result, compiler promotion is blocked.

## Technical Details
This patch adds the pragma push diagnostics to ignore the
lifetime-warnings in the modified files to unblock compiler promotion.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-28 11:19:46 +00:00
Linjun-AMD
d8c79ad9e8 [CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095)" (#5849)
This reverts commit bf707265a8.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 20:36:39 +00:00
Johannes Graner
94bc6ab3ab [CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393)
## Motivation

Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.

## Technical Details

StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)

Both persistent and non-persistent data-parallel (DP) sections are
supported.

Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression

### Performance (MI355X, gfx950)

Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):

| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |

StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.

## Test Plan

```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk

# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```

30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression

## Test Result

All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-27 10:17:10 +01:00
Yi DING
f15df65126 [CK_TILE] Fix NaN for FMHA BWD When seq_q=0 (#5790)
## Motivation
This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the
effective query sequence length for a tile is zero, by ensuring the
per-tile pipelines exit early with zeroed accumulators and by avoiding
an early kernel return that prevented writing out cleared gradients.

## Technical Details
- Add unconditional early-exit in the dK/dV pipelines when
`num_total_loop <= 0` (no work), returning zeroed accumulators.
- Adjust group-mode kernel early-return logic to only return when
**both** `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and
store cleared dK/dV when `seqlen_q == 0`.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-27 15:54:01 +08:00
joyeamd
6733fd72a5 [CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789)
## Motivation

fix ck_tile's oob check.



## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-26 09:40:44 +08:00
Ville Pietilä
d1ba4e9b11 [CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516)
## Motivation

We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things

- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.

## Technical Details

Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.

---------

Co-authored-by: Ville Pietilä <>
2026-03-25 14:34:13 +00:00
joyeamd
22df18b10e Revert "Ck/joye/revert oob check (#5640)" (#5697)
This reverts commit 1cc5380ee9.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-23 22:04:14 +00:00
Bartłomiej Kocot
0b7ae0ac65 [CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555)
## Motivation

Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.

## Technical Details

Round up K / Kperxdl and divide it by Kbatch to allow tail for K.

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-21 23:55:24 +01:00
Bartłomiej Kocot
2e22c67ce6 [CK][CK Tile] Improve access for merged groups and remove modulo from xor (#5334)
## Motivation

[CK][CK Tile] Improve access for merged groups and remove modulo from
xor

## Technical Details

- add template parameter to xor if modulo is needed. We don't need
modulo for merged groups
- use access by m for merged groups for a tensor
- 
## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 15:45:45 +00:00
Kiefer van Teutem
e785241250 [CK_Tile] Refactor amdgcn_mma policy structs (#5272)
## Motivation
The point of this MR is to update the intrinsic layout parameters to
simplify them and make them more clear and flexible. Also, a number of
simple refactors were performed to reduce boilerplate and code
duplication.

## Technical Details
In CK Tile and old CK, the full set of information available in the
intrinsic wrappers, for WMMA and MFMA combined, would be something like:

```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kAMBlock;
static constexpr index_t kBNBlock;

static constexpr index_t kRepeat;
static constexpr index_t kAMLane;
static constexpr index_t kBNLane;
static constexpr index_t kABK0PerLane;
static constexpr index_t kABKLane;
static constexpr index_t kABK1PerLane;

static constexpr index_t kCMLane;
static constexpr index_t kCNLane;
static constexpr index_t kCM0PerLane;
static constexpr index_t kCM1PerLane;

using kABPs2RHssMajor = sequence<2, 1>;
using kABPs2RHssMinor = sequence<1, 0>;
using kABYs2RHsMajor  = sequence<2, 2>;
using kABYs2RHsMinor  = sequence<0, 2>;

using kCPs2RHssMajor = sequence<1, 2>;
using kCPs2RHssMinor = sequence<1, 0>;
using kCYs2RHsMajor  = sequence<1, 1>;
using kCYs2RHsMinor  = sequence<0, 2>;

using kCTPs2RHssMajor = sequence<2, 1>;
using kCTPs2RHssMinor = sequence<1, 0>;
using kCTYs2RHsMajor  = sequence<2, 2>;
using kCTYs2RHsMinor  = sequence<0, 2>;   
 ```
Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to:

```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even
for diff A / B layouts
static constexpr index_t kAKNumAccess; // K2
static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kBKNumAccess; // K2
static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kCMPerLane;   // M2 * M0
static constexpr index_t kCMNumAccess; // M2

// Derived properties
using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;
```

Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features.

Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency.

I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp

Other refactorings I did in this MR:

- Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier.
- Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code.
- Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead.
- Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately).

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 09:07:00 -06:00
joyeamd
1cc5380ee9 Ck/joye/revert oob check (#5640)
## Motivation

fix ck_tile's oob check. 


## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 12:30:08 +00:00
arai713
005f9fc582 [CK_TILE] Rename Stream-K grid function (#4795)
## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().

## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.

## Test Plan
Validated using the test suite that is already present.

## Test Result
All tests passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 03:27:44 -06:00
Sami Remes
bf707265a8 [CK_TILE] Enable MXFP6 for MX GEMM op (#5095)
## Motivation

Add support for MXFP6 in the MX GEMM op in CK-Tile.

Depends on https://github.com/ROCm/rocm-libraries/pull/4594

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-19 18:07:47 -07:00
Bartłomiej Kocot
b90e64e600 [CK][CK Tile] Fix dram step for KM/KN layouts in V1 pipeline (#5470)
## Motivation

Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile
window.

## Technical Details

- Fix dram step for KM/KN layouts in V1 pipeline
- Disable instances which use more threads than warp size in continous
dim (not supported in ck tile yet)
- Use 1x1 specialization for explicit gemm
- Use two stage for vectorsize =1 and sizeof(datatype) ==2
- remove not needed check sinze GetVectorSizeA/B check if vector size is
fixed

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966
2026-03-19 11:59:44 +00:00
assistant-librarian[bot]
7a8410498d [CK_TILE] add tf32 support (#4302)
## Proposed changes

TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [ ] Any dependent changes have been merged

## Discussion



---
🔁 Imported from
[ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538)
🧑‍💻 Originally authored by @yingluAMD

---------

Co-authored-by: yingluAMD <Yingmao.Lu@amd.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-19 10:17:20 +01:00
Thomas Ning
bb04a24b96 CK Tile MX GEMM Packing Improvement (#5323)
## Motivation

Reduce the scale loading size and also has better utilization of MFMA
scale selection.

## Technical Details

Add up the packing of mx scales.

## Test Plan

Use the existing test cases.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
2026-03-17 11:57:32 -07:00
Hosang
4b0ec8d90c [CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018)
## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.

This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.

In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.

## Technical Details
This PR adds two complementary strategies:

- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
  - Estimate LLC size (env override, KFD sysfs, or arch default).
  - Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).

- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
  - Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
     improving temporal locality of per-head K/V reuse.

Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
    deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.

## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`

## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x

- TFLOPs by sequence length target: gfx1201 layout: bshd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 21:18:34 +00:00
Enrico Degregori
4c5465c4f9 [CK Tile] Eight Waves pipeline GEMM (#4964)
## Motivation

Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM

## Technical Details

Summary:
 - Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
   - Use this block struct in ABQuant for encodings
 - Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
 - Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves

So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts

## Test Plan

Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).

Note: K padding test is disabled for this pipeline because it's not
implemented yet

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 09:30:54 +01:00
Yi DING
da5841d5cc [CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174)
## Motivation
This PR enables a persistent-kernel execution path for FMHA backward
(dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is
split, stored, and converted back to final gradients.

## Technical Details
- Introduces a persistent-kernel grid mapping in deterministic mode and
updates split-count calculation accordingly.
- Extends kernel kargs to carry batch-related info needed for persistent
scheduling and dQ conversion.
- Refactors dQ store conditions and adds mask-type traits/utilities and
runner logging updates.

## Test Plan
- Jenkins
[base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline)
- Jenkins
[AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline)
- Jenkins
[FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline)
- local FA tests

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-13 14:13:32 +08:00
Ville Pietilä
5d6e69194d [CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK Tile profiler (#5237)
## Motivation

PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The
two-stage kernels were not supported in the initial PR. This PR adds the
the missing bwd weight two-stage kernels to the CK Profiler.

## Technical Details

Extended the CK Tile conv builder factory to build also the elementwise
ops required for the two-stage kernels. Extended the CK Builder for CK
Tile instance to accept the two-stage flag as part of the algorithm
configuration.

## Test Plan

Added units tests for CK Builder that verify the two-stage kernel
construction.
  
## Test Result

If CI passes, the added unit tests are passing.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Ville Pietilä <>
2026-03-12 19:20:15 -06:00
Adam Osewski
aea2a90872 [CK TILE] Skip work if any of Grouped GEMM groups M/N/K are zero. (#5050)
## Motivation

It's common in MoE workloads that some experts receive zero tokens,
which would result in some of the dimensions equal to zero. Currently we
handle such case only for non-persistent kernels where we have all GEMMs
information beforehand on host - we validate this during creation of
kernel arguments. However for the "dynamic" input path (persistent
kernel) this information is not available before kernel launch. Thus we
have to validate this during kernel execution. The goal is to add this
validation.

## Technical Details

Skip work if any of Grouped GEMM groups M/N/K are zero for persistent
kernel path.

## Test Plan

Add unit-tests which cover "dynamic" inputs with zero dims for
persistent kernel execution path.

## Test Result

All tests pass.

## Submission Checklist

- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 13:28:24 +00:00
chris-tsiaousis-hpc
6c49b6a670 Changed the include order of the new WMMA/MFMA unification framework (#5241)
Those changes are to fix the include order and make header files
independent of one another. Also the `remod.py` sript has run and
changed the `grouped_convolution.hpp` and `core.hpp` files.

## Motivation

Some headers appear to depend on include order.
For example, when moving `#include "wmma/wmma.hpp"` in
[amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp)
later in the include list, it is causing compilation errors. Also the
pre-commit script `remod.py` is shuffling includes to be in alphabetical
order and is causing compilation issues.

Expected behaviour:
Headers should be independent of one another: no header should require
another to be included first. Each header should compile correctly on
its own.

## Test Plan

The CI (that runs `remod.py`) should compile.

## Test Result

Existing CI should compile and be green.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-03-12 09:26:58 +01:00
Bartłomiej Kocot
1972d39410 [CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114)
## Motivation

Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details

- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-786
2026-03-11 23:38:15 +01:00
Christopher Millette
0c7665a331 [CK_TILE] Optimize ck_tile::sequence to reduce template instantiation depth [2A] (#5028)
## Summary

### Rationale
`ck_tile::sequence` is the most fundamental metaprogramming type in
ck_tile — it underpins
tensor dimensions, strides, loop bounds, and index calculations. Six of
its metafunctions
use recursive template instantiation, producing O(N) to O(N²)
intermediate types that
the compiler must process. When these are used inside deeply nested GEMM
pipelines with
large dimension counts, the cumulative instantiation overhead becomes a
significant
contributor to frontend compile time.

Measurements on `test_gemm_pipeline_compv6` show 84,288
`InstantiateFunction` calls in
the frontend alone. Reducing template instantiation depth in these core
utilities has a
multiplicative effect because they are called from hundreds of sites.

### What changed

| Metafunction | Before | After |
|---|---|---|
| `sequence::modify` | O(N) recursive split/merge | O(1) pack expansion
|
| `sequence_gen` | O(log N) recursive binary split | O(1) via
`__make_integer_seq` |
| `uniform_sequence_gen` | Delegates to `sequence_gen` | O(1) via
`__make_integer_seq` |
| `sequence_reverse_inclusive_scan` | O(N) recursive | O(1) constexpr
for-loop + pack expansion |
| `sequence_inclusive_scan` | Computed via reverse + flip | O(1)
constexpr for-loop (unified impl) |
| `sequence_exclusive_scan` | O(N) recursive merge chain | O(1)
constexpr for-loop + pack expansion |
| `sequence_map_inverse` | O(N²) recursive modify calls | O(1) constexpr
for-loop + pack expansion |

Supporting changes:
- Portable `__type_pack_element` fallback with `__has_builtin` guard
(hipRTC-safe, no `<tuple>` dependency)
- Renamed reserved `__integer_sequence` to `integer_sequence_wrapper`
- Adopted `static_array` from develop (PR #4355) for constexpr
computation
- Unified forward and reverse inclusive scan into a single
`sequence_inclusive_scan_impl` with `bool Reverse` template parameter
- Added `sequence_inclusive_scan` struct (new public API for forward
scan direction)
- Replaced recursive `sequence_exclusive_scan` (3 template
specializations) with `sequence_exclusive_scan_impl` using the same
constexpr for-loop pattern as inclusive scan
- Rewired `exclusive_scan_sequence` and `prefix_sum_sequence` to use new
impl
- Added `CK_TILE_HOST_DEVICE` to `exclusive_scan_sequence` and
`prefix_sum_sequence` to match sibling scan function annotations

### Technical debt and housekeeping
- Unified all `namespace impl` to `namespace detail` across sequence.hpp
for consistency
- Removed dead comment block (orphaned `integer_sequence` alternative)
- Added defensive `static_assert(sizeof...(Is) > 0)` in
`sequence_map_inverse::build_inverse`
- Converted all multi-line Doxygen blocks from `///` to `/** */` per
style guide
- Corrected `constexpr static` to `static constexpr` keyword ordering in
`static_array`
- Added blank line between `#pragma once` and first `#include` in
`static_array.hpp`
- Trimmed redundant 4-line comment on `sequence_gen_helper` to a
one-liner
- Moved `sequence_gen` Doxygen comment below `namespace detail` block so
it directly precedes the struct it documents
- Added Doxygen `@brief`/`@tparam`/`@pre` documentation for
`sequence_gen` and `sequence_map_inverse` public APIs
- Added `@brief` documentation to `static_array` explaining relationship
to `ck_tile::array`
- Added scope comment at `namespace detail` openings

**Note:** `private:`/`public:` access modifier indentation is enforced
at 4 spaces by
`.clang-format`. The style guide calls for left-alignment, but the
formatter overrides
this. Requires a `.clang-format` config change to resolve — not
addressable in code.

### `static_array` hardening (from develop's PR #4355)
- Added zero-length array guard (`T elems[N > 0 ? N : 1]`)
- Added `CK_TILE_HOST_DEVICE` annotations to `operator[]` and `size()`
- Added `#include "ck_tile/core/config.hpp"` (IWYU for
`CK_TILE_HOST_DEVICE`)

### Value
Combined with the `static_ford` changes, measured impact on
`test_gemm_pipeline_compv6`:
- **Frontend: -28.9%** (InstantiateFunction: 84,288 → 69,439)
- **Backend: -13.1%** (CodeGen Functions: 3,170 → 2,203)
- **Wall-clock: -16.3%** (611.6s → 512.2s)

### Files changed (4)
- `sequence.hpp`: Metafunction optimizations, namespace unification,
documentation, style fixes
- `static_array.hpp`: Zero-length guard, `CK_TILE_HOST_DEVICE`,
documentation, style fixes
- `test_sequence.cpp`: 50 unit tests with runtime `EXPECT_EQ` assertions
(new file)
- `CMakeLists.txt`: Register new test target

## Test plan
- [x] 50 runtime unit tests covering all optimized and pre-existing
sequence APIs
- [x] Edge cases: empty sequences, single-element, larger sizes (N=8),
negative values, non-trivial init values
- [x] Both functor signatures tested (`operator()(index_t)` and
`operator()(number<I>)`)
- [x] Both scan reducers (`plus`, `multiplies`) with forward, reverse,
inclusive, and exclusive directions
- [x] Exclusive scan: sum, product, single, empty, non-zero init
- [x] Prefix sum: N+1 output verification, single, empty
- [x] Permutation round-trip verification for `sequence_map_inverse`
- [x] Full sequence public API coverage: modify, gen, uniform_gen, scans
(inclusive, exclusive, prefix sum), map_inverse, make_index_sequence,
size/sum/product, push/pop, reverse, extract, merge, arithmetic
operators, equality, transform
- [x] Portable `__type_pack_element` fallback tested implicitly (same
`at_index_t` interface)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2026-03-11 14:24:54 -06:00
Anton Gorenko
c90d46e57d [CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368)
## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 09:59:50 +00:00
Sami Remes
600d778482 [CK_TILE] MX GEMM non-preshuffled RCR layout (#4594)
## Motivation

Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2026-03-10 20:12:05 +00:00
Hosang
4e031ebd3c [CK_TILE] Update gfx11 FMHA forward kernel configs (#5088)
## Motivation
Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded
seqlen_q/k) cases.
This tuning is based on heuristic search and improves performance in
most tested shapes.
Performance should be evaluated on top of
[`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018)
(required baseline).

## Technical Details

  - Updated gfx11 codegen heuristic choices for tile size and occupancy.
   - Updated gfx11 pipeline selection:
- Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently
slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad
cases are dispatched to the faster kernel path.`
- Kept gfx12 unchanged: with PSSK support from
[`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957),
existing gfx12 config is already sufficient.
  - Tuning rationale:
    - In some cases, higher `kBlockPerCu` lowers register pressure.
- On RDNA, this generally aligns with better performance when
`waves_per_eu >= 6`.

## Test Plan
- test_ck_tile_fmha
- tile_example_fmha_fwd: tested this on gfx1100 and gfx1151
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24
-d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result
- TFLOPs by sequence length target: `gfx1100` layout: `bhsd`
- mode: batch / VGPR usage: 225 vs 214

SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 74.10 | 71.97 | 0.97x
4096 | 66.26 | 77.79 | 1.17x
8192 | 68.18 | 75.88 | 1.11x
12288 | 68.47 | 80.44 | 1.17x
16384 | 59.54 | 79.66 | 1.34x
20480 | 55.78 | 77.91 | 1.40x
24576 | 55.08 | 77.47 | 1.41x
27280 | 47.45 | 77.16 | 1.63x
- mode: group / VGPR usage: 256 vs 214

SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 71.47 | 70.6 | 0.99x
4096 | 64.74 | 77.06 | 1.19x
8192 | 64.68 | 75.47 | 1.17x
12288 | 66.43 | 79.95 | 1.20x
16384 | 56.02 | 79.73 | 1.42x
20480 | 50.21 | 78.15 | 1.56x
24576 | 47.29 | 77.53 | 1.64x
27280 | 46.13 | 77.04 | 1.67x

- TFLOPs by sequence length target: `gfx1151` layout: `bshd`
- mode: batch / VGPR usage: 225 vs 223

Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 26.85 | 29.17 | 1.09x
4096 | 24.75 | 26.01 | 1.05x
8192 | 25.24 | 25.50 | 1.01x
12288 | 25.18 | 25.00 | 0.99x
16384 | 24.79 | 25.91 | 1.05x
20480 | 25.56 | 25.24 | 0.99x
24576 | 25.13 | 26.20 | 1.04x
27280 | 10.78 | 26.35 | 2.44x
- mode: group / VGPR usage: 256 vs 229

Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 27.44 | 26.71 | 0.97x
4096 | 21.89 | 23.09 | 1.05x
8192 | 22.85 | 24.49 | 1.07x
12288 | 24.33 | 24.42 | 1.00x
16384 | 20.05 | 24.98 | 1.24x
20480 | 14.70 | 25.15 | 1.71x
24576 | 11.30 | 26.31 | 2.33x
27280 | 10.10 | 26.32 | 2.61x


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-10 09:46:41 -07:00