Commit Graph

3993 Commits

Author SHA1 Message Date
Bartłomiej Kocot
ce4525e82b [CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555)
## Motivation

Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.

## Technical Details

Round up K / Kperxdl and divide it by Kbatch to allow tail for K.

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-21 23:55:24 +01:00
Emily Martins
59e6fc67d8 [CK_TILE] Prune Stream-K Tile Engine Tests (#5625)
## Motivation

Stream-K tile engine tests are causing issues for build time. While we
work on a more permanent solution, these changes prune the Stream-K test
instances to help reduce the build time burden.

## Technical Details

The Stream-K team recently transitioned to using CK Tile's tile engine
infrastructure for our smoke tests. However, since tile engine creates
an individual target per kernel instance, we've found that the tile
engine tests are increasing build times. Our team is currently working
to convert our existing tile engine tests back to basic gtests. While
this work takes place, we are temporarily pruning the existing Stream-K
tile engine test instances to help reduce the build time burden.

## Test Plan

Ran the pruned test set on all gfx90a, gfx942, and gfx950.

## Test Result

All tests pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 13:30:34 -07:00
andrew clark
ba959369e3 Improved CI infrastructure failure detection (#5464)
## Motivation

This PR re-enables CI infrastructure failure detection and notification,
which had been disabled due to performance issues caused by loading
large build logs (~80k lines) into memory for pattern scanning. The goal
is to reliably detect known infrastructure failures (GPU errors, Docker
authentication issues, disk space errors, etc.) and send actionable
Teams notifications without hanging on large logs.

## Technical Details

- Replaced full build log loading and Groovy-based pattern scanning with
a streaming wget | grep -E pipe. grep scans natively so the full log is
never loaded into Groovy, resolving the hang on large logs.
- Combined all failure patterns into a single grep -E call to avoid
multiple log fetches.
- The node name is now tracked with the observed failure.
- Added a new failure pattern for device's running out of space.

## Test Plan

- Forced failures in the "Determine CI Execution" stage with all 9
failure patterns echoed to the build log.
- Simulated large log sizes (~80k lines of dummy output) to validate
pattern detection and node name extraction at realistic log scales,
including patterns placed both before and after large blocks of dummy
output.

## Test Result

All 9 failure patterns detected correctly. Teams notifications sent with
accurate log context, node name, and job links. No hangs observed on 80k
line simulated logs.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 19:16:58 +00:00
Jobbins
8892f83890 add self healing to ref repo (#5630)
## Motivation

Check for when mirror repo gets corrupted in CI

## Technical Details

We detect broken ref objects and rebuild the local mirror in that case
of corruption

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 09:42:53 -07:00
Bartłomiej Kocot
5db7b220d5 [CK][CK Tile] Improve access for merged groups and remove modulo from xor (#5334)
## Motivation

[CK][CK Tile] Improve access for merged groups and remove modulo from
xor

## Technical Details

- add template parameter to xor if modulo is needed. We don't need
modulo for merged groups
- use access by m for merged groups for a tensor
- 
## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 15:45:45 +00:00
Kiefer van Teutem
e8f9bb0c19 [CK_Tile] Refactor amdgcn_mma policy structs (#5272)
## Motivation
The point of this MR is to update the intrinsic layout parameters to
simplify them and make them more clear and flexible. Also, a number of
simple refactors were performed to reduce boilerplate and code
duplication.

## Technical Details
In CK Tile and old CK, the full set of information available in the
intrinsic wrappers, for WMMA and MFMA combined, would be something like:

```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kAMBlock;
static constexpr index_t kBNBlock;

static constexpr index_t kRepeat;
static constexpr index_t kAMLane;
static constexpr index_t kBNLane;
static constexpr index_t kABK0PerLane;
static constexpr index_t kABKLane;
static constexpr index_t kABK1PerLane;

static constexpr index_t kCMLane;
static constexpr index_t kCNLane;
static constexpr index_t kCM0PerLane;
static constexpr index_t kCM1PerLane;

using kABPs2RHssMajor = sequence<2, 1>;
using kABPs2RHssMinor = sequence<1, 0>;
using kABYs2RHsMajor  = sequence<2, 2>;
using kABYs2RHsMinor  = sequence<0, 2>;

using kCPs2RHssMajor = sequence<1, 2>;
using kCPs2RHssMinor = sequence<1, 0>;
using kCYs2RHsMajor  = sequence<1, 1>;
using kCYs2RHsMinor  = sequence<0, 2>;

using kCTPs2RHssMajor = sequence<2, 1>;
using kCTPs2RHssMinor = sequence<1, 0>;
using kCTYs2RHsMajor  = sequence<2, 2>;
using kCTYs2RHsMinor  = sequence<0, 2>;   
 ```
Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to:

```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even
for diff A / B layouts
static constexpr index_t kAKNumAccess; // K2
static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kBKNumAccess; // K2
static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kCMPerLane;   // M2 * M0
static constexpr index_t kCMNumAccess; // M2

// Derived properties
using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;
```

Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features.

Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency.

I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp

Other refactorings I did in this MR:

- Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier.
- Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code.
- Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead.
- Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately).

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 09:07:00 -06:00
Bartłomiej Kocot
0a3229ea22 [CK][CK Tile] Move grouped conv cpp instances to build dir (#5609)
## Motivation

Move grouped conv .cpp instances to build dir. Fix generate instances
script.

## Technical Details

Avoid CI problem when instances in experimental directory are not
removed

## Test Plan

test_grouped_convnd_*_tile

## Test Result

Pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 13:19:31 +00:00
joyeamd
d2b129f56b Ck/joye/revert oob check (#5640)
## Motivation

fix ck_tile's oob check. 


## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 12:30:08 +00:00
arai713
121fa1a503 [CK_TILE] Rename Stream-K grid function (#4795)
## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().

## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.

## Test Plan
Validated using the test suite that is already present.

## Test Result
All tests passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 03:27:44 -06:00
yinglu
b973cc43d3 [CK]fix: remove redundant structured sparsity check in run_gemm_example.inc (#5612)
## Motivation

This issue if found via
https://github.com/ROCm/rocm-libraries/pull/4302#discussion_r2958603418
and is introduced via https://github.com/ROCm/rocm-libraries/pull/5323.

## Technical Details

The outer `if` and inner `if constexpr` both checked
GemmConfig::UseStructuredSparsity. Merged into a single `if constexpr`
since both preshuffle and UseStructuredSparsity are compile-time
constants.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 09:22:14 +01:00
Sami Remes
a699df9fdc [CK_TILE] Enable MXFP6 for MX GEMM op (#5095)
## Motivation

Add support for MXFP6 in the MX GEMM op in CK-Tile.

Depends on https://github.com/ROCm/rocm-libraries/pull/4594

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-19 18:07:47 -07:00
Yaswanth Raparti
a61238b69f [CK][CK TILE] Fix smart-build to run install target for client examples (#5614)
How ninja install works:
- Builds library dependencies (device_operations, etc.)
- Installs them to CMAKE_INSTALL_PREFIX
- Skips building test executables (not install dependencies)

Affected stages (8):
- gfx942/gfx950/gfx908/gfx90a CK Client Examples
- gfx10-1/gfx10-3/gfx11/gfx12 CK Client Examples

## Motivation

Problem: When smart-build is enabled (runAllUnitTests=false), the build
step is skipped entirely. This causes client example stages to fail
because they depend on the CK library being installed to ../install.

Error seen:
  Target "client_gemm" links to:
    composable_kernel::device_other_operations
  but the target was not found.

## Technical Details
Root cause: Line 712 only checked runAllUnitTests, so when building with
config_targets="install", the install target was never built, leaving
the install directory empty.

Fix: Added condition to always build when config_targets contains
'install'. The install target automatically builds its dependencies (the
CK libraries) but skips building tests, which aligns with smart-build
philosophy.

## Test Plan

Should be tested on CI 

## Test Result

Should be tested on CI

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-03-19 22:00:29 +00:00
Bartłomiej Kocot
aa5ae1c005 [CK][CK Tile] Fix dram step for KM/KN layouts in V1 pipeline (#5470)
## Motivation

Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile
window.

## Technical Details

- Fix dram step for KM/KN layouts in V1 pipeline
- Disable instances which use more threads than warp size in continous
dim (not supported in ck tile yet)
- Use 1x1 specialization for explicit gemm
- Use two stage for vectorsize =1 and sizeof(datatype) ==2
- remove not needed check sinze GetVectorSizeA/B check if vector size is
fixed

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966
2026-03-19 11:59:44 +00:00
assistant-librarian[bot]
39bc8453c6 [CK_TILE] add tf32 support (#4302)
## Proposed changes

TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [ ] Any dependent changes have been merged

## Discussion



---
🔁 Imported from
[ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538)
🧑‍💻 Originally authored by @yingluAMD

---------

Co-authored-by: yingluAMD <Yingmao.Lu@amd.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-19 10:17:20 +01:00
Yaswanth Raparti
1333922f04 [CK] [CK_TILE] Improve build and test time of CI with smart dependency parser (#5249)
## Motivation

Existing dependency parser needs full build of tests to determine which
tests are affected by code changes in a PR. This still takes 2-4 hours
for building the tests which slows down the CI as the number of tests
grow. To resolve this issue we implemented a smart dependency parser
which uses CMake Configure to parse dependencies and build only the
affected test cases. We have ensured that two approaches are available
1) CMake pre-build analysis for each PR to ensure fast build and test.
2) Ninja post-build analysis to enable full build for nightly tests.

## Technical Details

```bash
### 1. Configure the project with CMake
cmake -G Ninja -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ..

### 2. Analyze dependencies (no build required!)
python3 ../script/dependency-parser/main.py cmake-parse compile_commands.json build.ninja \
  --workspace-root .. --output cmake_dependency_mapping.json --parallel 8

### 3. Find tests affected by changes
python3 ../script/dependency-parser/main.py select cmake_dependency_mapping.json origin/develop \
  HEAD --test-prefix --output tests_to_run.json

### 4. Build only affected tests
ninja $(jq -r '.executables[]' tests_to_run.json | tr '\n' ' ')

### 5. Run affected tests
ctest -R "$(jq -r '.regex' tests_to_run.json)"
```

### Jenkins Integration
- Added `buildMode` to jenkinsfile to integrate both `selective` and
`full` build methods

### Known Limitations

### 1. Build-Time Generated Headers (HIGH RISK)

**Problem:** Files generated during the build process (e.g., via
`add_custom_command`) cannot be analyzed before building.

**Example:**
```cmake
add_custom_command(
  OUTPUT ${CMAKE_BINARY_DIR}/generated/config.hpp
  COMMAND generate_config.sh
  DEPENDS template.hpp.in
)
```

**Impact:** If a source file includes `generated/config.hpp`, the
dependency won't be detected until after building.

**Mitigation:**
- CK analysis shows **no generated headers** currently used
- If generated headers are added in the future, they must be built first
- Recommendation: Generate headers in CMake configure phase (not build
phase) when possible


## Test Plan
**1. Modified Files:**
```
include/ck_tile/ops/common.hpp
include/ck_tile/ops/gemm.hpp
include/ck_tile/ops/gemm/warp/warp_gemm.hpp
```
**2. Compare tests selected between `build.ninja` and `cmake-parse`
methods**


## Test Result
- 1. The test completed in 5-6 minutes finding about 8000+ executables
that should be built.
- 2. We selected a commit 5ccc1387ea which resulted in same 7 tests with
both legacy and new methods.
- 


PR | Legacy tests | Smart tests | Notes
-- | -- | -- | --
5261 | 453 | 455 | Only 2 tests (test_amdgcn_mma and
test_amdgcn_sparse_mma)
5168 | 0 | 0 | Changes in   dispatcher only. No CK tests invoked.
5249 | 0 | 0 | Changes to   dependency parser. No CK tests invoked
5260 | 0 | 0 | Changes in   dispatcher only. No CK tests invoked.
5174 | 1 | 1 | One test from FMHA   affected by this PR in both cases
5383 | 0 | 0 | Changes are only in benchmark files. Did not trigger any
tests
5445 | 1 | 1 | Changes are only to tests/ck_tile/gemm_streamk. Only
triggered one streamk test in both cases.
5454 | 3 | 3 | Both methods identified same test_grouped_conv_bwd tests
5427 | 234 | 234 | Core infrastructure header changes. Detected exactly
same tests
5388 | 85 | 85 | modifies warp-level GEMM operations (warp_gemm.hpp,
warp_gemm_dispatcher.hpp). Correctly identified all the streamK gemm
tests







## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-19 05:30:19 +00:00
lalala-sh
b0df41c187 [CK] Fix MOE FP8 SplitK buffer descriptor OOB (#5086)
When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base
pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces
are still based on full K dimension. This causes hardware buffer
resource descriptors to extend beyond the actual tensor allocation,
leading to GPU memory access faults when the tensor happens to be placed
at the end of an allocated memory pool region.

Fix by subtracting the split offset from each buffer's element space in
both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer
descriptor range [shifted_base, shifted_base + reduced_space) exactly
covers the valid allocation.

Also refactor SplitKBatchOffset to accept const Problem& (instead of
Argument&) and add a default constructor, enabling direct reuse in
Run/Run_2Lds without duplicating offset calculation logic.

Made-with: Cursor

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2026-03-19 10:41:53 +08:00
Christopher Millette
4fdcfab53d [CK] Replace nested static_for with static_ford to reduce device IR function emissions [1B] (#5031)
## Summary

### Rationale
CK's GPU kernels are among the slowest files in the ROCm build, with a
single translation
unit taking up to 10+ minutes. Profiling with `-ftime-trace` identified
nested `static_for`
loops as the root cause: each nesting level multiplies the number of
unique lambda IR
functions the compiler must process. A 2-level nest of `static_for<0, M,
1>` /
`static_for<0, N, 1>` produces M×N unique lambda types. With typical
GEMM dimensions
(M=16, N=4), a single nest generates 64 unique functions — and these
nests appear hundreds
of times across the codebase.

The LLVM backend's CGSCC (Call Graph Strongly Connected Components)
framework processes
each function independently, so reducing function count directly reduces
backend time.

### What changed
393 nested compile-time loop patterns across 73 files are converted to
`static_ford`, which
flattens multi-dimensional compile-time iteration into a single
`static_for` with index
decomposition. This eliminates 994 `static_for` nesting levels (42%
reduction).

Three pattern categories were converted:
- **Category A**: `static_for` wrapping `static_ford` — fold outer
dimension into ford
- **Category B**: nested `static_ford` — merge into single
higher-dimensional ford
- **Category C**: nested `static_for` chains — convert to single
`static_ford`

### Verification

**ASM equivalence: PASS — 51/51 device assembly files identical (gfx942
+ gfx1100)**

| Architecture | Files compared | Largest file | Result |
|---|---|---|---|
| gfx942 | 36 | 386,685 lines | ALL MATCH |
| gfx1100 | 15 | 47,769 lines | ALL MATCH |

**Build time (Wilcoxon signed-rank test, 7 paired trials):**

| Target | Pre (s) | Post (s) | Delta | p-value |
|---|---|---|---|---|
| bscale | 169 | 152 | **-9.8%** | 0.016 \* |
| xdl_v1234 | 207 | 194 | **-6.6%** | 0.016 \* |
| preshuffle | 275 | 264 | **-3.9%** | 0.016 \* |
| xdl_base | 142 | 137 | **-3.2%** | 0.031 \* |

**IR function counts (device backend, gfx942):**

| Target | InstFunc Δ | CodeGen Δ | Compiler Δ |
|---|---|---|---|
| bscale | -13,043 (-8.2%) | -2,103 (-3.5%) | -10.7% |
| xdl_v1234 | -9,431 (-5.7%) | +59 (+0.1%) | -5.2% |
| xdl_base | -6,162 (-4.9%) | -1,141 (-2.5%) | -2.2% |
| xdl_old | -3,234 (-3.7%) | -963 (-8.7%) | -3.3% |

### Value
- **994 fewer `static_for` nesting levels** (-42%) across 73 files
- **393 `static_ford` sites** created (from 4 pre-existing)
- **Up to 9.8% compile-time reduction** on representative targets
(statistically significant, p < 0.05)
- **Up to 13K fewer IR function instantiations** per translation unit
- Net -849 LOC from reduced indentation
- **Zero ASM changes** — identical device code output verified on gfx942
and gfx1100
- All scheduling barriers, `if constexpr` guards, and MFMA/WMMA
accumulation order preserved

### Files changed (73)
- `block/`: 47 files (GEMM pipelines — xdlops, wmma, moe, preshuffle,
blockscale variants)
- `grid/`: 20 files (softmax, normalization, reduction, attention,
layernorm)
- `thread/`: 5 files (tensor slice transfer, contraction, GEMM dlops,
reduction)
- `tensor_description/`: 1 file (tensor_adaptor)

## Test plan
- [x] `static_ford` tested with 21 unit tests in
`test/util/unit_ford.cpp`
  (1D-4D, custom orders, compile-time verification)
- [x] All conversions preserve iteration order, `block_sync_lds()`
placement,
  `if constexpr` scheduling guards, and MFMA/WMMA accumulation order
- [x] ASM equivalence verified: 51 device `.s` files across gfx942 +
gfx1100
- [x] Build-time improvement statistically confirmed (Wilcoxon, p <
0.05, 4 targets)
- [x] IR function count reduction confirmed via `-ftime-trace` on 7
targets
- [x] Detection script reports 0 remaining safe patterns (180 blocked
with structural reasons)
- [x] Existing CI tests (GEMM, softmax, normalization, batch norm,
reduction,
  attention) exercise all converted code paths

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 08:45:22 -06:00
Thomas Ning
75357722b8 CK Tile MX GEMM Packing Improvement (#5323)
## Motivation

Reduce the scale loading size and also has better utilization of MFMA
scale selection.

## Technical Details

Add up the packing of mx scales.

## Test Plan

Use the existing test cases.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
2026-03-17 11:57:32 -07:00
Hosang
b5894b3cbe [CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018)
## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.

This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.

In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.

## Technical Details
This PR adds two complementary strategies:

- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
  - Estimate LLC size (env override, KFD sysfs, or arch default).
  - Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).

- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
  - Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
     improving temporal locality of per-head K/V reuse.

Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
    deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.

## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`

## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x

- TFLOPs by sequence length target: gfx1201 layout: bshd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 21:18:34 +00:00
Bartłomiej Kocot
1e1f3647f7 [CK][CK Tile] Grouped Convolution backward weight profiler flush cache (#5454)
## Motivation

Flush cache to get more stable results during profiling old ck and ck
tile.

## Technical Details

Flush cache before each kernel call and one more first run.

## Test Plan

test_grouped_conv_bwd_weight_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 17:46:21 +00:00
lalala-sh
4f091cacd0 [CK] fix moe memset size which is bigger than alloc (#5225)
## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 17:30:07 +08:00
Enrico Degregori
a9a8a57786 [CK Tile] Eight Waves pipeline GEMM (#4964)
## Motivation

Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM

## Technical Details

Summary:
 - Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
   - Use this block struct in ABQuant for encodings
 - Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
 - Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves

So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts

## Test Plan

Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).

Note: K padding test is disabled for this pipeline because it's not
implemented yet

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 09:30:54 +01:00
Bartłomiej Kocot
6373f5df7b [CK][CK Tile] Grouped Convolution Backward Weight set of fixes (#5387)
## Motivation

Grouped Convolution Backward Weight split k fixes for CK tile kernels

## Technical Details

- get k batch from kargs to get deduced k batch
- multiply zeroing size by data type size
- disable v6 (producing a incorrect results)

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

Pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Ville Pietilä <>
2026-03-13 10:18:19 -06:00
Yi DING
f6bfcad437 [CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174)
## Motivation
This PR enables a persistent-kernel execution path for FMHA backward
(dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is
split, stored, and converted back to final gradients.

## Technical Details
- Introduces a persistent-kernel grid mapping in deterministic mode and
updates split-count calculation accordingly.
- Extends kernel kargs to carry batch-related info needed for persistent
scheduling and dQ conversion.
- Refactors dQ store conditions and adds mask-type traits/utilities and
runner logging updates.

## Test Plan
- Jenkins
[base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline)
- Jenkins
[AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline)
- Jenkins
[FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline)
- local FA tests

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-13 14:13:32 +08:00
Ville Pietilä
5d2cbd1117 [CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK Tile profiler (#5237)
## Motivation

PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The
two-stage kernels were not supported in the initial PR. This PR adds the
the missing bwd weight two-stage kernels to the CK Profiler.

## Technical Details

Extended the CK Tile conv builder factory to build also the elementwise
ops required for the two-stage kernels. Extended the CK Builder for CK
Tile instance to accept the two-stage flag as part of the algorithm
configuration.

## Test Plan

Added units tests for CK Builder that verify the two-stage kernel
construction.
  
## Test Result

If CI passes, the added unit tests are passing.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Ville Pietilä <>
2026-03-12 19:20:15 -06:00
dependabot[bot]
3fe871bb0c Bump tornado from 6.5.4 to 6.5.5 in /projects/composablekernel/docs/sphinx (#5376)
Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.4 to
6.5.5.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst">tornado's
changelog</a>.</em></p>
<blockquote>
<h1>Release notes</h1>
<p>.. toctree::
:maxdepth: 2</p>
<p>releases/v6.5.5
releases/v6.5.4
releases/v6.5.3
releases/v6.5.2
releases/v6.5.1
releases/v6.5.0
releases/v6.4.2
releases/v6.4.1
releases/v6.4.0
releases/v6.3.3
releases/v6.3.2
releases/v6.3.1
releases/v6.3.0
releases/v6.2.0
releases/v6.1.0
releases/v6.0.4
releases/v6.0.3
releases/v6.0.2
releases/v6.0.1
releases/v6.0.0
releases/v5.1.1
releases/v5.1.0
releases/v5.0.2
releases/v5.0.1
releases/v5.0.0
releases/v4.5.3
releases/v4.5.2
releases/v4.5.1
releases/v4.5.0
releases/v4.4.3
releases/v4.4.2
releases/v4.4.1
releases/v4.4.0
releases/v4.3.0
releases/v4.2.1
releases/v4.2.0
releases/v4.1.0
releases/v4.0.2
releases/v4.0.1
releases/v4.0.0
releases/v3.2.2
releases/v3.2.1
releases/v3.2.0
releases/v3.1.1</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="7d6465056c"><code>7d64650</code></a>
Merge pull request <a
href="https://redirect.github.com/tornadoweb/tornado/issues/3586">#3586</a>
from bdarnell/update-cibw</li>
<li><a
href="d05d59b808"><code>d05d59b</code></a>
build: Bump cibuildwheel to 3.4.0</li>
<li><a
href="c2f46732b0"><code>c2f4673</code></a>
Merge pull request <a
href="https://redirect.github.com/tornadoweb/tornado/issues/3585">#3585</a>
from bdarnell/release-655</li>
<li><a
href="e5f1aa4b6f"><code>e5f1aa4</code></a>
Release notes and version bump for v6.5.5</li>
<li><a
href="78a046f99f"><code>78a046f</code></a>
httputil: Add CRLF to _FORBIDDEN_HEADER_CHARS_RE</li>
<li><a
href="24a2d96ea1"><code>24a2d96</code></a>
web: Validate characters in all cookie attributes.</li>
<li><a
href="119a195e29"><code>119a195</code></a>
httputil: Add limits on multipart form data parsing</li>
<li>See full diff in <a
href="https://github.com/tornadoweb/tornado/compare/v6.5.4...v6.5.5">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tornado&package-manager=pip&previous-version=6.5.4&new-version=6.5.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/ROCm/rocm-libraries/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-12 14:22:13 -04:00
Adam Osewski
7a12ecc762 [CK TILE] Skip work if any of Grouped GEMM groups M/N/K are zero. (#5050)
## Motivation

It's common in MoE workloads that some experts receive zero tokens,
which would result in some of the dimensions equal to zero. Currently we
handle such case only for non-persistent kernels where we have all GEMMs
information beforehand on host - we validate this during creation of
kernel arguments. However for the "dynamic" input path (persistent
kernel) this information is not available before kernel launch. Thus we
have to validate this during kernel execution. The goal is to add this
validation.

## Technical Details

Skip work if any of Grouped GEMM groups M/N/K are zero for persistent
kernel path.

## Test Plan

Add unit-tests which cover "dynamic" inputs with zero dims for
persistent kernel execution path.

## Test Result

All tests pass.

## Submission Checklist

- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 13:28:24 +00:00
Michal Kulikowski
3c2c294ffb [CK][Test] Moving device_op creation before data initialization.
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2026-03-12 09:47:56 +01:00
Michal Kulikowski
29c4f868ef [CK][Examples] Adding parameters for a couple of CK examples:
-gemm_add_add_mean_meansquare_xdl_fp16
-gemm_dl_quantization_int8
-gemm_xdl_bias_relu_quantization_int8
-gemm_xdl_quantization_int8

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2026-03-12 09:47:41 +01:00
chris-tsiaousis-hpc
9c1c4c9168 Changed the include order of the new WMMA/MFMA unification framework (#5241)
Those changes are to fix the include order and make header files
independent of one another. Also the `remod.py` sript has run and
changed the `grouped_convolution.hpp` and `core.hpp` files.

## Motivation

Some headers appear to depend on include order.
For example, when moving `#include "wmma/wmma.hpp"` in
[amdgcn_mma.hpp](https://github.com/ROCm/rocm-libraries/blob/develop/projects/composablekernel/include/ck_tile/core/arch/mma/amdgcn_mma.hpp)
later in the include list, it is causing compilation errors. Also the
pre-commit script `remod.py` is shuffling includes to be in alphabetical
order and is causing compilation issues.

Expected behaviour:
Headers should be independent of one another: no header should require
another to be included first. Each header should compile correctly on
its own.

## Test Plan

The CI (that runs `remod.py`) should compile.

## Test Result

Existing CI should compile and be green.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-03-12 09:26:58 +01:00
Aviral Goel
a124535133 ck_tile: add gtest unit tests for MX flatmm (gfx950) (#5082)
## Summary

- Add correctness unit tests for the MX-format flatmm kernel
(`example/ck_tile/18_flatmm/mxgemm`) under `test/ck_tile/flatmm/`
- Tests cover all five dtype combinations: FP4×FP4, FP8×FP8, FP6×FP6,
FP8×FP4, FP4×FP8
- Tests cover all four kernel dispatch paths (the `has_hot_loop` ×
`tail_num` product):
  - `has_hot_loop=false, tail=ODD` (K=256, num_loop=1)
  - `has_hot_loop=false, tail=EVEN` (K=512, num_loop=2)
  - `has_hot_loop=true, tail=ODD` (K=768, num_loop=3)
  - `has_hot_loop=true, tail=EVEN` (K=1024, num_loop=4)
- Remove unsupported `-split_k` CLI option from
`tile_example_mx_flatmm`; the pre-shuffled B layout is incompatible with
K-splitting and the option silently produced wrong results

## Changes

**New files (`test/ck_tile/flatmm/`):**
- `CMakeLists.txt` — builds 40 kernel instances as a shared OBJECT
library, links into 5 per-dtype test executables; forwards
`-DCK_TILE_USE_OCP_FP8` when `CK_USE_OCP_FP8` is ON
- `test_mx_flatmm_base.hpp` — base test fixture with
`run_test_with_validation(M, N, K, kbatch=1)`
- `test_mx_flatmm_fixtures.hpp` — concrete `TestMXFlatmm` typed test
class and type aliases
- `test_mx_flatmm_fp{4fp4,8fp8,6fp6,8fp4,4fp8}.cpp` — per-dtype
`TYPED_TEST_SUITE` files

**Modified files:**
- `example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp` — moved
`preShuffleWeight` here (was in `mx_flatmm.cpp`) so it is includeable by
both the example and the tests
- `example/ck_tile/18_flatmm/mxgemm/mx_flatmm.cpp` / `run_mx_flatmm.inc`
— removed `-split_k` CLI arg, hardcoded `k_batch=1`, fixed `k_split`
formula, updated call sites after `preShuffleWeight` move
- `test/ck_tile/CMakeLists.txt` — added `add_subdirectory(flatmm)`

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-03-11 15:46:58 -07:00
Bartłomiej Kocot
86ac3213ed [CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114)
## Motivation

Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details

- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-786
2026-03-11 23:38:15 +01:00
Emily Martins
ddf49459ba [CK_TILE] Add the GEMM Memory pipeline to Stream-K tests (#5242)
## Motivation

We want to extend our Stream-K coverage to include other GEMM pipeline
since our current tests only test the CompV3 pipeline.

## Technical Details

All Stream-K unit tests currently only tests one pipeline: CompV3. These
changes extend the test support to also test the Memory pipeline. Future
work will add support for additional GEMM pipelines.

The major changes are as follows:
- **Remove of fp8 and bf8 extended tests for gfx90a**: gfx90a does not
have native support for fp8 and bf8 and emulate the behavior with fp32
mfma instruction sizes. We've observed extremely long compile times for
fp8 and bf8 on gfx90a (exceeding 15 minutes), hence we've opted to
disable these tests.
- **Add the memory pipeline to the Stream-K tile engine tests**: Now our
smoke tests covers compv3 and memory pipelines.
- **Add the memory pipeline to the Stream-K extended tests**: These
changes modify the test kernel types to include the appropriate
pipeline. Each pipeline is contained within a separate kernel type to
help avoid large increases in build time.

## Test Plan

- Ran existing and added tests on all architectures.

## Test Result

- All local tests pass. 

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 15:09:17 -06:00
Christopher Millette
4947624915 [CK_TILE] Optimize ck_tile::sequence to reduce template instantiation depth [2A] (#5028)
## Summary

### Rationale
`ck_tile::sequence` is the most fundamental metaprogramming type in
ck_tile — it underpins
tensor dimensions, strides, loop bounds, and index calculations. Six of
its metafunctions
use recursive template instantiation, producing O(N) to O(N²)
intermediate types that
the compiler must process. When these are used inside deeply nested GEMM
pipelines with
large dimension counts, the cumulative instantiation overhead becomes a
significant
contributor to frontend compile time.

Measurements on `test_gemm_pipeline_compv6` show 84,288
`InstantiateFunction` calls in
the frontend alone. Reducing template instantiation depth in these core
utilities has a
multiplicative effect because they are called from hundreds of sites.

### What changed

| Metafunction | Before | After |
|---|---|---|
| `sequence::modify` | O(N) recursive split/merge | O(1) pack expansion
|
| `sequence_gen` | O(log N) recursive binary split | O(1) via
`__make_integer_seq` |
| `uniform_sequence_gen` | Delegates to `sequence_gen` | O(1) via
`__make_integer_seq` |
| `sequence_reverse_inclusive_scan` | O(N) recursive | O(1) constexpr
for-loop + pack expansion |
| `sequence_inclusive_scan` | Computed via reverse + flip | O(1)
constexpr for-loop (unified impl) |
| `sequence_exclusive_scan` | O(N) recursive merge chain | O(1)
constexpr for-loop + pack expansion |
| `sequence_map_inverse` | O(N²) recursive modify calls | O(1) constexpr
for-loop + pack expansion |

Supporting changes:
- Portable `__type_pack_element` fallback with `__has_builtin` guard
(hipRTC-safe, no `<tuple>` dependency)
- Renamed reserved `__integer_sequence` to `integer_sequence_wrapper`
- Adopted `static_array` from develop (PR #4355) for constexpr
computation
- Unified forward and reverse inclusive scan into a single
`sequence_inclusive_scan_impl` with `bool Reverse` template parameter
- Added `sequence_inclusive_scan` struct (new public API for forward
scan direction)
- Replaced recursive `sequence_exclusive_scan` (3 template
specializations) with `sequence_exclusive_scan_impl` using the same
constexpr for-loop pattern as inclusive scan
- Rewired `exclusive_scan_sequence` and `prefix_sum_sequence` to use new
impl
- Added `CK_TILE_HOST_DEVICE` to `exclusive_scan_sequence` and
`prefix_sum_sequence` to match sibling scan function annotations

### Technical debt and housekeeping
- Unified all `namespace impl` to `namespace detail` across sequence.hpp
for consistency
- Removed dead comment block (orphaned `integer_sequence` alternative)
- Added defensive `static_assert(sizeof...(Is) > 0)` in
`sequence_map_inverse::build_inverse`
- Converted all multi-line Doxygen blocks from `///` to `/** */` per
style guide
- Corrected `constexpr static` to `static constexpr` keyword ordering in
`static_array`
- Added blank line between `#pragma once` and first `#include` in
`static_array.hpp`
- Trimmed redundant 4-line comment on `sequence_gen_helper` to a
one-liner
- Moved `sequence_gen` Doxygen comment below `namespace detail` block so
it directly precedes the struct it documents
- Added Doxygen `@brief`/`@tparam`/`@pre` documentation for
`sequence_gen` and `sequence_map_inverse` public APIs
- Added `@brief` documentation to `static_array` explaining relationship
to `ck_tile::array`
- Added scope comment at `namespace detail` openings

**Note:** `private:`/`public:` access modifier indentation is enforced
at 4 spaces by
`.clang-format`. The style guide calls for left-alignment, but the
formatter overrides
this. Requires a `.clang-format` config change to resolve — not
addressable in code.

### `static_array` hardening (from develop's PR #4355)
- Added zero-length array guard (`T elems[N > 0 ? N : 1]`)
- Added `CK_TILE_HOST_DEVICE` annotations to `operator[]` and `size()`
- Added `#include "ck_tile/core/config.hpp"` (IWYU for
`CK_TILE_HOST_DEVICE`)

### Value
Combined with the `static_ford` changes, measured impact on
`test_gemm_pipeline_compv6`:
- **Frontend: -28.9%** (InstantiateFunction: 84,288 → 69,439)
- **Backend: -13.1%** (CodeGen Functions: 3,170 → 2,203)
- **Wall-clock: -16.3%** (611.6s → 512.2s)

### Files changed (4)
- `sequence.hpp`: Metafunction optimizations, namespace unification,
documentation, style fixes
- `static_array.hpp`: Zero-length guard, `CK_TILE_HOST_DEVICE`,
documentation, style fixes
- `test_sequence.cpp`: 50 unit tests with runtime `EXPECT_EQ` assertions
(new file)
- `CMakeLists.txt`: Register new test target

## Test plan
- [x] 50 runtime unit tests covering all optimized and pre-existing
sequence APIs
- [x] Edge cases: empty sequences, single-element, larger sizes (N=8),
negative values, non-trivial init values
- [x] Both functor signatures tested (`operator()(index_t)` and
`operator()(number<I>)`)
- [x] Both scan reducers (`plus`, `multiplies`) with forward, reverse,
inclusive, and exclusive directions
- [x] Exclusive scan: sum, product, single, empty, non-zero init
- [x] Prefix sum: N+1 output verification, single, empty
- [x] Permutation round-trip verification for `sequence_map_inverse`
- [x] Full sequence public API coverage: modify, gen, uniform_gen, scans
(inclusive, exclusive, prefix sum), map_inverse, make_index_sequence,
size/sum/product, push/pop, reverse, extract, merge, arithmetic
operators, equality, transform
- [x] Portable `__type_pack_element` fallback tested implicitly (same
`at_index_t` interface)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2026-03-11 14:24:54 -06:00
JP-Fernando
2c850cd693 [CK] Unify the grouped convolution gridwise Run() functions (#4421)
## Motivation

There are currently three different grouped convolution related Run()
function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`.
These are used for the different types of grouped convolution: Forward,
Backward weights, and Backward data.
The functions are very similar and should be unified to a single `Run()`
function for all types of grouped convolution.

## Technical Details

The three old `Run<>()` functions were replaced with a single unified
function.
The new `Run<>()` function is run from device implementations:
  
-  DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3
  
-  DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3
  
-  DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
  
-  DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3
  
-  DeviceGroupedConvBwdWeight_Wmma_CShuffleV3

The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor
implementation uses a different `Run<>()` overload and was therefore not
modified.

## Test Plan

Run the following grouped convolution tests on `gfx1201`, as this
architecture is WMMA-capable:

- `test_grouped_convnd_fwd`

- `test_grouped_convnd_bwd_weight`

- `test_grouped_convnd_bwd_data`

Compilation and testing were also executed on `gfx1100` to avoid CI
problems.

## Test Result

First part (unification of `Run<>()` function): All tests successful.

Second part (integration of single `Run<>()` function as a direct call):
All tests successful.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
2026-03-11 17:38:55 +01:00
JP-Fernando
5d4107862b [CK] Add BF16^3 support to grouped conv bwd weight: bilinear and scale (#4591)
## Motivation

Until now, XDL grouped conv bwd weight for bilinear and scale only
supported bf16f32bf16.
Therefore, bf16bf16bf16 support should be added.

## Technical Details

Instances were added to the relevant files in
`library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_bwd_weight/`
folder. In addition, `add()` functions were included in new files in
`library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/xdl/`
and
`library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_scale/xdl/`
folders. The new .cpp files were also included in the `CMakeFiles.txt`
files of both folders.

## Test Plan

Execute `grouped_convnd_bwd_weight` tests to check execution on
different architectures.
The tests for bilinear and scale already include the tuple
`std::tuple<ck::half_t, ck::half_t, ck::half_t, ck::Number<3>>`, so in
principle, there is nothing to modify in the tests themselves.

## Test Result

`gfx1201`: Tests passed.
`gfx1100`: Tests passed.
`gfx90a`: Tests passed.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
2026-03-11 13:05:44 +01:00
Anton Gorenko
25d9fdfc16 [CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368)
## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 09:59:50 +00:00
Thrupti Raj Lakshmana Gowda
791ebfb517 [CK TILE ENGINE] Add grouped_gemm operator to Tile Engine (gfx942/gfx950) (#4996)
## Motivation

The grouped_gemm CK Tile kernel exists (e.g.,
`example/17_grouped_gemm/`) but has no Tile Engine wrapper. Grouped GEMM
handles multiple independent GEMM problems with varying M/N/K dimensions
in a single kernel launch. This PR adds the Tile Engine infrastructure
for automated kernel generation, benchmarking, and profiling of grouped
GEMM kernels.

Jira: AICK-809

## Technical Details

- Created Tile Engine wrapper under `tile_engine/ops/gemm/grouped_gemm/`
following the `gemm_universal` template
- Files added: `CMakeLists.txt`, `grouped_gemm_common.hpp`,
`grouped_gemm_benchmark.hpp`, `grouped_gemm_profiler.hpp`,
`grouped_gemm_benchmark.py`, `grouped_gemm_benchmark_single.cpp`,
`grouped_gemm_instance_builder.py`, `configs/`
- Supported datatypes: fp16, fp8, bf16, bf8
- Supported layouts: rcr, rrr, ccr, crr
- Target GPUs: gfx942, gfx950
- CK Tile kernel: `ck_tile::GroupedGemmKernel` from
`include/ck_tile/ops/gemm/kernel/grouped_gemm_kernel.hpp`
- Instance builder extends `GemmKernelBuilder` base class
- Registered in `tile_engine/ops/gemm/CMakeLists.txt`
- Updated Jenkinsfile to build and benchmark grouped_gemm targets in CI
- Benchmark infrastructure includes JSON output, CSV export, and
verification support

## Test Plan

- CMake configure succeeds for grouped_gemm targets
- Kernel instance builder generates valid kernel headers for all
(datatype, layout) combinations
- At least one kernel binary compiles and runs per datatype/layout
combination
- Correctness passes with `--verify 1` on gfx942/gfx950

## Test Result



## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-10 18:58:37 -05:00
John Shumway
2b68a9baf5 [CK_BUILDER] Add DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284)
Add factory, InstanceTraits, and conv traits support for the WMMA V3
forward convolution kernel, enabling the CK Builder to generate and
dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs.

## Motivation

As reported in issue #4944, MIOpen includes WMMA V3 forward convolution
kernels, so this PR adds support for those kernels similarly to other
supported kernels.

## Technical Details

This follows the same implementation as the other kernels. I added some
support for reflection, but I left a few todos since we need to
generalize our convolution traits to generalize across WMMA/MFMA and
CK/CKTile.

## Test Plan

Added faster tests to `ninja smoke-builder` that check the
instance-traits logic, and I added longer tests that instantiate
kernels, following the existing pattern in other kernals.

## Test Result

I tested all code with `ninja check-builder` on a gfx1101 build and ran
on gfx1101.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 16:41:51 -07:00
Brock Hargreaves
83bbb96890 [CK] Fix warp tile combination selection in absence of a GPU (#5213)
## Motivation

The `get_gpu_name_by_id()` function in
`gemm_streamk_validation_utils.py` relies on `rocminfo` to detect the
GPU architecture at runtime. However, __`rocminfo` fails in CI/build
environments__ where:

- No physical GPU is present
- ROCm tools are not installed
- The build is running in a container without GPU access

In any of these environments, the problem manifests itself in incorrect
kernel validation and will generate template instantiations that do not
exist:

```
[composable_kernel] FAILED: test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o 
[composable_kernel] /__w/TheRock/TheRock/build/core/clr/dist/lib/llvm/bin/clang++ -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_ENABLE_TF32 -DCK_TILE_USE_WMMA=0 -DCK_TIME_KERNEL=1 -DCK_USE_FNUZ_FP8 -DCK_USE_GFX94 -DCK_USE_XDL -DDPP_KERNELS -DGEMM_SINGLE_INSTANCE_HPP=\"/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp\" -DGEMM_TEST_PARAMS_HPP=\"/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/test_params.hpp\" -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/profiler/include -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/library/include -I/__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include -I/__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/include -I/__w/TheRock/TheRock/build/profiler/rocprofiler-sdk/stage/include -I/__w/TheRock/TheRock/build/profiler/roctracer/stage/include -I/__w/TheRock/TheRock/build/base/half/stage/include -I/__w/TheRock/TheRock/build/third-party/sysdeps/linux/libdrm/build/stage/lib/rocm_sysdeps/include -isystem /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx942   -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -fno-offload-uniform-block -mllvm --lsr-drop-solution=1 -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -Wno-undefined-func-template -Wno-float-equal --offload-compress -include /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp -MD -MT test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o -MF test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o.d -o test/ck_tile/gemm_streamk_tile_engine/CMakeFiles/test_gemm_streamk_tile_engine_fp16_rcr_streamk_atomic_smoke_tests_config_fp16_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.dir/test_gemm_streamk_simple.cpp.o -x hip -c /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/gemm_streamk_tile_engine/test_gemm_streamk_simple.cpp
[composable_kernel] In file included from <built-in>:2:
[composable_kernel] In file included from /__w/TheRock/TheRock/build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_streamk_single_fp16_rcr_compv3_cshuffle_intrawave_atomic_False_False_False_False_256x256x32_2x2x1_16x16x8.hpp:9:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm.hpp:23:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/block/block_gemm_asmem_bsmem_creg_v1.hpp:7:
[composable_kernel] In file included from /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/block/block_gemm_asmem_bsmem_creg_v1_default_policy.hpp:8:
[composable_kernel] /__w/TheRock/TheRock/rocm-libraries/projects/composablekernel/include/ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp:185:1: error: implicit instantiation of undefined template 'ck_tile::impl::warp_gemm_dispatcher::Dispatcher<_Float16, _Float16, float, 16, 16, 8, false, false, false, ck_tile::WGAttrNumAccessEnum::Single, ck_tile::WGAttrNumAccessEnum::Single>'
```

## Technical Details

### Changes Made:

#### 1. __gemm_streamk_validation_utils.py__

- Added module-level storage: `_configured_gpu_targets`

- Added `set_gpu_targets(targets: List[str])` to configure fallback GPU
targets

- Added `get_configured_gpu_targets() -> List[str]` to retrieve
configured targets

- Enhanced `get_gpu_name_by_id()` to:

  - First try `rocminfo` (existing behavior)
  - If `rocminfo` fails, fall back to first configured GPU target
  - Extract base gfx name (e.g., "gfx90a" from "gfx90a:xnack+")
  - Log debug messages when using fallback

#### 2. __gemm_streamk_instance_builder.py__

- Added `--gpu_targets` command-line argument
- Automatically calls `set_gpu_targets()` when `--gpu_targets` is
provided
- Parses semicolon-separated GPU target list from CMake

#### 3. __test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt__

- Modified both `--list_kernels` and `--gen_single` invocations to pass
`--gpu_targets "${SUPPORTED_GPU_TARGETS}"`
- GPU targets are now automatically wired from CMake to Python scripts

### How It Works:

1. __CMake Configuration__: `SUPPORTED_GPU_TARGETS` is determined from
`GPU_TARGETS` or defaults
2. __CMake → Python__: CMake passes targets via `--gpu_targets` argument
to Python scripts
3. __Python Configuration__: Scripts call `set_gpu_targets()` to
configure the fallback
4. __Fallback Mechanism__: When `rocminfo` fails, `get_gpu_name_by_id()`
uses the first configured target
5. __Target Parsing__: Extracts clean gfx name (e.g., "gfx90a" from
"gfx90a:xnack+")


## Test Plan

Confirm that only the appropriate kernels are selected and that CI
passes.

## Test Result

1. Waiting on CI
2. Compilation succeeded locally and the kernel list does not contain
the 16x16x8 kernel for gfx942 anymore:

```
(.venv) bhargrea@ctr-cx66-mi300x-02:~/github/TheRock$ cat build/ml-libs/composable_kernel/build/test/ck_tile/gemm_streamk_tile_engine/fp16/rcr/streamk_atomic_smoke_tests_config_fp16/gemm_kernel_list.txt
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_16x16x16|256x256x32_2x2x1_16x16x16|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_16x16x16|256x256x32_2x2x1_16x16x16|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_16x16x32|256x256x32_2x2x1_16x16x32|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_16x16x32|256x256x32_2x2x1_16x16x32|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_32x32x8|256x256x32_2x2x1_32x32x8|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_32x32x8|256x256x32_2x2x1_32x32x8|compv3_cshuffle_intrawave_atomic_False_False_False_False
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_True_256x256x32_2x2x1_32x32x16|256x256x32_2x2x1_32x32x16|compv3_cshuffle_intrawave_atomic_False_False_False_True
gemm_fp16_rcr_compv3_cshuffle_intrawave_Atomic_False_False_False_False_256x256x32_2x2x1_32x32x16|256x256x32_2x2x1_32x32x16|compv3_cshuffle_intrawave_atomic_False_False_False_False
```

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-10 17:11:56 -06:00
Sami Remes
8b46a8d997 [CK_TILE] MX GEMM non-preshuffled RCR layout (#4594)
## Motivation

Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2026-03-10 20:12:05 +00:00
Max Podkorytov
90a7253d86 [CK] Precompute SpaceFillingCurve indices to reduce compile time by 31% (#5041)
## Summary

Optimize `SpaceFillingCurve` in CK to reduce compile time by
precomputing all index values into a static constexpr lookup table.

### Problem
- `GetIndex<N>` was instantiated separately for every index value (0 to
NumAccesses-1)
- Each instantiation triggered nested `static_for` loops with O(N²)
template depth
- This caused **34,000+ template instantiations** taking **69 seconds**
in frontend

### Solution
- Add `IndexLookupTable<NumAccesses, nDim>` to store all precomputed
indices
- Add `compute_single_index()` helper using O(N) `static_for` loops
- Add `compute_all_indices()` to build entire table in one constexpr
evaluation
- `GetIndex<N>` becomes simple array lookup: `return index_table[N]`

### Results (conv2d_fwd_xdl_nhwc_kyxc_nhwk_f16_instance.cpp)

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Total compile time | 120.4s | 83.6s | **-31%** |
| Frontend time | 88.7s | 52.6s | **-41%** |
| GetIndex instantiations | 34,176 | 384 | **-99%** |
| GetIndex time | 69.0s | 0.11s | **-99.8%** |
| SpaceFillingCurve time | 75.7s | 4.3s | **-94%** |

## Test plan
- [x] Builds successfully with `-Werror -Weverything`
- [ ] Run existing unit tests
- [ ] Verify numerical correctness on sample kernels

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2026-03-10 12:40:08 -07:00
Brock Hargreaves
37f3c99ea0 [CK] Streamk tile engine test not setting a reasonable CU_COUNT default when the query fails (#5165)
## Motivation

The following error was coming up when compiling on Windows when the
generate_configs.py file tries to query the GPU for the number of CU's:

```
[composable_kernel configure] -- Generating Stream-K test config files for fp16
[composable_kernel configure] Traceback (most recent call last):
[composable_kernel configure]   File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 277, in <module>
[composable_kernel configure]     main()
[composable_kernel configure]     ~~~~^^
[composable_kernel configure]   File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 271, in main
[composable_kernel configure]     cu_count, configs_dir_path, tile_sizes, datatype = get_args()
[composable_kernel configure]                                                        ~~~~~~~~^^
[composable_kernel configure]   File "E:\TheRock\rocm-libraries\projects\composablekernel\test\ck_tile\gemm_streamk_tile_engine\generate_configs.py", line 267, in get_args
[composable_kernel configure]     return (int(args.cu_count), args.configs_dir_path, args.tiles, args.datatype)
[composable_kernel configure]             ~~~^^^^^^^^^^^^^^^
[composable_kernel configure] ValueError: invalid literal for int() with base 10: 'Exit code 0xc0000135\n'
[composable_kernel configure] CMake Error at test/ck_tile/gemm_streamk_tile_engine/generate_configs.cmake:98 (message):
[composable_kernel configure]   Eror occured during execution of
[composable_kernel configure]   E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/gemm_streamk_tile_engine/generate_configs.py
[composable_kernel configure] Call Stack (most recent call first):
[composable_kernel configure]   test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt:301 (generate_test_configs)
[composable_kernel configure]
[composable_kernel configure]
[composable_kernel configure] -- Configuring incomplete, errors occurred!
[composable_kernel configure FAILED WITH CODE 1 in 41 seconds]
ninja: build stopped: subcommand failed.
```

## Technical Details

There was one major problem in the following code and two changes were
made:

```
        execute_process(
            COMMAND ${CPP_EXE_PATH}
            OUTPUT_STRIP_TRAILING_WHITESPACE
            ERROR_VARIABLE standard_error
            RESULT_VARIABLE queried_cu_count
        )

        if (standard_error)
            message(STATUS "Error information from attempting to query HIP device and properties:\n"
                            "${standard_error}")
        endif()
```



1. RESULT_VARIABLE does not capture the IO output of the executable, but
rather the exit code. You can see from the error output here that it was
trying to cast "Exit code 0xc0000135\n" to an integer. I fixed this by
changing RESULT_VARIABLE to OUTPUT_VARIABLE.

```
[composable_kernel configure] ValueError: invalid literal for int() with base 10: 'Exit code 0xc0000135\n'
```

Note that this also gives us the reason that the query failed: Exit code
0xc0000135, which needs to be addressed in a separate issue: "Exit code
0xc0000135, also seen as -1073741515, is a Windows error indicating that
an application failed to start because a required Dynamic Link Library
(DLL) file or a system component like the .NET Framework is missing or
corrupted"

It's likely the executable that is created from this code can't find the
hip dll, or something similar:

```
        set(CPP_FILE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cu_count.cpp)
        set(CPP_EXE_PATH ${CMAKE_CURRENT_BINARY_DIR}/cu_count)

        execute_process(
            COMMAND ${CMAKE_HIP_COMPILER} -x hip ${CPP_FILE_PATH} -o ${CPP_EXE_PATH}
            RESULT_VARIABLE compile_result
        )
```

2. For clarity and consistency purposes, I changed the check afterwards
to explicitly look for a non-zero exit code. This matches previous
checks in the cmake file. I also added improved error checking when the
query for the cu count fails.

## Test Plan

Ensure it compiles locally and existing CI isn't impacted.

## Test Result

Waiting on CI.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
2026-03-10 10:52:18 -07:00
Hosang
559ad6f0b1 [CK_TILE] Update gfx11 FMHA forward kernel configs (#5088)
## Motivation
Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded
seqlen_q/k) cases.
This tuning is based on heuristic search and improves performance in
most tested shapes.
Performance should be evaluated on top of
[`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018)
(required baseline).

## Technical Details

  - Updated gfx11 codegen heuristic choices for tile size and occupancy.
   - Updated gfx11 pipeline selection:
- Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently
slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad
cases are dispatched to the faster kernel path.`
- Kept gfx12 unchanged: with PSSK support from
[`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957),
existing gfx12 config is already sufficient.
  - Tuning rationale:
    - In some cases, higher `kBlockPerCu` lowers register pressure.
- On RDNA, this generally aligns with better performance when
`waves_per_eu >= 6`.

## Test Plan
- test_ck_tile_fmha
- tile_example_fmha_fwd: tested this on gfx1100 and gfx1151
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24
-d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result
- TFLOPs by sequence length target: `gfx1100` layout: `bhsd`
- mode: batch / VGPR usage: 225 vs 214

SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 74.10 | 71.97 | 0.97x
4096 | 66.26 | 77.79 | 1.17x
8192 | 68.18 | 75.88 | 1.11x
12288 | 68.47 | 80.44 | 1.17x
16384 | 59.54 | 79.66 | 1.34x
20480 | 55.78 | 77.91 | 1.40x
24576 | 55.08 | 77.47 | 1.41x
27280 | 47.45 | 77.16 | 1.63x
- mode: group / VGPR usage: 256 vs 214

SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 71.47 | 70.6 | 0.99x
4096 | 64.74 | 77.06 | 1.19x
8192 | 64.68 | 75.47 | 1.17x
12288 | 66.43 | 79.95 | 1.20x
16384 | 56.02 | 79.73 | 1.42x
20480 | 50.21 | 78.15 | 1.56x
24576 | 47.29 | 77.53 | 1.64x
27280 | 46.13 | 77.04 | 1.67x

- TFLOPs by sequence length target: `gfx1151` layout: `bshd`
- mode: batch / VGPR usage: 225 vs 223

Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 26.85 | 29.17 | 1.09x
4096 | 24.75 | 26.01 | 1.05x
8192 | 25.24 | 25.50 | 1.01x
12288 | 25.18 | 25.00 | 0.99x
16384 | 24.79 | 25.91 | 1.05x
20480 | 25.56 | 25.24 | 0.99x
24576 | 25.13 | 26.20 | 1.04x
27280 | 10.78 | 26.35 | 2.44x
- mode: group / VGPR usage: 256 vs 229

Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 27.44 | 26.71 | 0.97x
4096 | 21.89 | 23.09 | 1.05x
8192 | 22.85 | 24.49 | 1.07x
12288 | 24.33 | 24.42 | 1.00x
16384 | 20.05 | 24.98 | 1.24x
20480 | 14.70 | 25.15 | 1.71x
24576 | 11.30 | 26.31 | 2.33x
27280 | 10.10 | 26.32 | 2.61x


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-10 09:46:41 -07:00
Bartłomiej Kocot
c749ef9da3 [CK][CK Tile] Add grouped conv backward weight tile test and fix tr load in BASE_V1 pipeline (#5115)
## Motivation

Test grouped conv backward weight from ck tile and fix incorrect values.

## Technical Details

- Add test for CI
- Add daily tests
- Fix transpose load in BASE_V1 pipeline

## Test Plan

test_grouped_convnd_backward_weight_tile

## Test Result

in progress

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-783
2026-03-10 03:03:04 +00:00
John Shumway
751e29ccb6 [CK_BUILDER] Clean up ConvDescription output formatting (#5085)
The `ConvDescription::getDetailedDescription()` output had several
issues that made it harder to read and potentially misleading:

1. **Bug fix**: The LDS padding field was incorrectly displaying
`dst_scalar_per_vector_k1` instead of the actual `lds_padding` value
2. **Noise reduction**: Optional parameters that weren't set were
printing unhelpful messages like "Struct does not contain optional
gemm_padding argument" — these add clutter without providing value to
the reader
3. **Formatting inconsistencies**: Trailing spaces after colons (e.g.,
`"Warp Gemm parameters: "`) and a stray trailing `×` in tile dimensions
4. **Missing thread cluster lengths**: The threads per axis are not
shown.

**Changes**:

- **Fixed the LDS padding bug** by using
`traits_.a_tile_transfer.transfer_params.lds_padding` and
`traits_.b_tile_transfer.transfer_params.lds_padding` instead of
duplicating `dst_scalar_per_vector_k1`
- **Simplified optional parameter handling**: Changed from printing
"Struct does not contain..." messages to simply omitting absent optional
values. Also switched from `.value_or()` to direct dereference (`*`)
since we're already inside an `if` check
- **Cleaned up formatting**: Removed trailing spaces after colons and
the extra `×` at the end of tile dimension lists
- **Added missing thread cluster lengths**: Added X×Y×Z" display for
both A and B tile transfer sections.
- **Fixed typo**: "Do Padd Gemm" → "Do Pad Gemm"
- **Fixed typo**: "scr" → "src"
- **Fixed typo**: "tensros" → "tensors"

- `ninja smoke-builder` ✓
- `ninja check-builder` ✓

The test file updates reflect the corrected expected output, which now
shows the actual `lds_padding` values (0 or 1), shows thread cluster
lenths, and omits the verbose "Struct does not contain..." lines.

**Note**: This PR follows PR #5083.
2026-03-09 22:36:30 +00:00
kensclin
362bfd72e8 [CK] Fix the issue of the aiter to call eightwarps pipeline. (#5218)
## Motivation

Fix the failure of the aiter to call eightwarp.
Changed Async to the name eightwarps.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

Pass

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-09 11:12:13 -07:00
rocking
baa73e1515 [CK_TILE] Fix FMHA async pipeline LDS sync issue (#4742)
## Motivation

Fix FMHA forward async pipeline
(`block_fmha_pipeline_qr_ks_vs_async.hpp`) sync issue.
Some attention test cases intermittently fail due to a race condition
where the V tile store to LDS overwrites K tile data that is still being
read by other threads during the tail `gemm_0` operation.

## Technical Details

In the `BlockFmhaPipelineQRKSVSAsync` pipeline, K and V tiles share the
same LDS memory through a rotation schedule (`LdsSeq`).
After the tail `gemm_0` (line 458), some fast threads may proceed to
store V to LDS (line 617) before slow threads finish reading K data from
the same LDS buffer.

The fix adds an `s_barrier` synchronization after the tail `gemm_0` when
K's last sub-tile and V's first sub-tile use the same LDS buffer (i.e.,
`LdsSeq[k0_loops - 1] == LdsSeq[k0_loops]`):

`if constexpr(LdsSeq.at(number<k0_loops - 1>{}) ==
LdsSeq.at(number<k0_loops>{}))
    __builtin_amdgcn_s_barrier();`

Why `s_barrier` alone is sufficient (no s_waitcnt lgkmcnt(0) needed):
The `gemm_0` MFMA instruction internally waits for its LDS operands
(ds_read) to complete before execution
Therefore, each thread's ds_read of K data is already complete by the
time gemm_0 finishes
Only cross-thread synchronization (`s_barrier`) is needed to ensure all
threads have finished reading before any thread starts writing V

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-03-09 18:05:36 +00:00
Márton Bidlek
a862155c9e Proof of concept for removing forward declarations (#5135)
## Motivation

Currently, we forward declare CK device operation templates in
CK-Builder's reflection code:

9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)
This is mainly required to break a circular dependency in reflection.
The architecture of that is as follows:


        
      
MyDeviceOp implements GetInstanceString(). This is typically defined
directly in the class definition (no forward declaration).

        
      
GetInstanceString() calls instance_string<MyDeviceOp>()


        
      
instance_string<MyDeviceOp>() calls
InstanceTraits<MyDeviceOp>::instance_string()


        
      
InstanceTraits has a specialization for MyDeviceOp which implements
instance_string()


So order for GetInstanceString() to work properly, InstanceTraits must
already be defined. And for InstanceTraits to be defined, the device op
needs to be defined. In order to do that, we are currently using
aforementioned forward declaration.

## Technical Details

C++'s lazy template evaluation is used by calling into an as-of-yet
undefined function static member function of
`InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then
specializing `InstanceTraits` only _after that_. The caveat here is that
both the device op itself as well as the instance traits specialization
must be in scope, otherwise there would be an undefined function error.
In practise, we can solve that either by placing the instance traits
directly into the file that defines `MyDeviceOp`, or possibly by using a
`.inc` file to keep the concerns separated.

## Test Plan

The results were verified by running the existing regression tests for
CK Builder

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>
2026-03-09 09:34:18 -07:00
Christopher Millette
8765b5f417 [CK_TILE] Fix MMA layout test to match amdgcn_mma OpFamily parameter (#5222)
## Summary

- PR #4837 added `MmaOpFamily OpFamily_` as a new template parameter to
`amdgcn_mma` and `MmaDefaultSelector`, but the MMA layout test (PR
#4495) was not updated to include it
- Add the missing `OpFamily_` parameter to all three `RegisterMapTraits`
partial specializations (gfx9, gfx11, gfx12) and all
`MmaDefaultSelector` usages
- Fixes build failure: `template argument for non-type template
parameter must be an expression`

## Test plan

- [x] Verified test compiles cleanly with ROCm 7.1.1 clang++ targeting
gfx90a
- [x] `test_amdgcn_mma_layout` gfx90a (MFMA): PASSED
- [x] `test_amdgcn_mma_layout` gfx1201 (WMMA): SKIPPED (no device)
- [x] `test_amdgcn_mma_layout` gfx1100 (WMMA): SKIPPED (no device)
- [x] CI validation on all GPU targets

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 22:17:56 -07:00