3438 Commits

Author SHA1 Message Date
Kiefer van Teutem
2089713f94 [rocm-libraries] ROCm/rocm-libraries#8227 (commit 75c30d5)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Remove=20unification=20Flag=20structs=20in=20favor=20of=20new?=
 =?UTF-8?q?=20WarpGemmParams=20(#8227)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Recently, the way flags are sent down to the intrinsics was changed in
CK Tile. At the point where the WarpGemm is invoked, an arbitrary number
of template parameters can be passed, and these are passed down all the
way to the lowest level intrinsics wrappers. Here
`WarpGemmParamsParser<>` is used to extract flags for the intrinsics.

In this MR we adapt the the unification framework (amdgcn_mma struct and
MmaPipelines) to work in the same way. By doing this, there is no longer
a point in our custom intrinsic Flag structs, so these are removed.

Unrelated but I also tried removing the MmaPipeline flags because they
arn't used for anything except CTranspose, which is already available.
This also make test_amdgcn_mma_pipeline completely redundant so removed
that as well.
2026-06-26 12:00:58 +00:00
Illia Silin
621697af8c [rocm-libraries] ROCm/rocm-libraries#8723 (commit e2f28c1)
[CK] [Security] pin getopt library to specific commit

## Motivation

Making sure that FetchContent_Declare is pulling third-party code from a
specific commit hash instead of the tip of main branch to avoid
accidentally pulling potentially harmful code.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-25 19:06:13 +00:00
spolifroni-amd
713f1fbf46 [rocm-libraries] ROCm/rocm-libraries#8739 (commit 833c182)
[hipcub, rocthrust, rocprim, ck] updating changelogs for 7.14
 (#8739)

## Motivation

The 7.14 change logs have a different format than the ones for 7.13.

This changes the format and also sets the right component versions.
2026-06-24 18:32:34 +00:00
chris-tsiaousis-hpc
e503e6277a [rocm-libraries] ROCm/rocm-libraries#8762 (commit fe88750)
[CK] Fix flaky test `test_batched_gemm_b_scale_wmma`

Loosen up the tolerance and add better logging in case of failure

TLDR: `profile_batched_gemm_b_scale_impl.hpp` (lines 390-392) uses `rtol
= atol = 1e-1`, while the sibling non-batched B-scale profiler uses
`2e-2` in `profile_gemm_b_scale_impl.hpp` (lines 361-364). `KBatch > 1`
adds an F16 atomic accumulation step, so the batched path is stricter on
the noisier arithmetic path. Loosening up the tolerances should fix the
flaky test. I've also added better loging in case of failures.

## Motivation

`test_batched_gemm_b_scale_wmma` can fail on gfx1201 when `KBatch > 1`
because that path uses split-K F16 atomic accumulation. The existing
batched B-scale tolerance was stricter than the non-batched B-scale
profiler, even though the batched split-K path has additional rounding
noise from native F16 atomics.

This PR aligns the batched tolerance with the comparable non-batched
path and improves failure diagnostics so future mismatches identify the
exact shape, KBatch value, instance, strides, and operator.

## Technical Details

Updated
[projects/composablekernel/profiler/include/profiler/profile_batched_gemm_b_scale_impl.hpp](projects/composablekernel/profiler/include/profiler/profile_batched_gemm_b_scale_impl.hpp)
to:

- Change non-FP8 verification tolerance from `1e-2` to `2e-2`.
- Match the tolerance already used by the non-batched B-scale profiler.
- Build a detailed failure message with:
  - instance index
  - `kbatch`
  - `M`, `N`, `K`, and `BatchSize`
  - `StrideA`, `StrideB`, and `StrideC`
  - selected operator name
- Move operator name construction earlier so it is available for
diagnostics.

## Test Plan

Configured the CK gfx1201 build with device instances enabled and XDL
disabled to isolate WMMA coverage.

Built and ran the focused batched GEMM B-scale WMMA test target:

```bash
ninja -C projects/composablekernel/build-gfx950-gfx1201 test_batched_gemm_b_scale_wmma
```

Ran the focused `MidLargeM` case across WMMA instances and then ran the
full WMMA-only gtest target.

## Test Result

The WMMA-only gfx1201 validation passed.

- `MidLargeM` passed across 21 WMMA instances.
- Full `test_batched_gemm_b_scale_wmma` passed all 3 gtests.
- No incorrect-result lines were reported.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-06-24 16:41:50 +00:00
Kiefer van Teutem
137f2a9a10 [rocm-libraries] ROCm/rocm-libraries#7407 (commit 0b79e05)
[CK TILE] Initial integration of MFMA / WMMA unification
 framework into CK Tile (#7407) (locked behind flag)

Note: Everything works but this is still a draft MR because I want to do
some more cleanup and maybe do some testing for MX fp6. Also please
don't trigger copilot, I will do this once I feel it is clean enough,
otherwise I'll get a bunch of comments about stuff I already know.

## Motivation
The point of this MR is to finally use our unification MmaPipelines to
replace the existing WarpGemms in CK tile and make sure everything
works. I focused on gfx908 and gfx950 for now, dense and scale
intrinsics, fp16, fp8, and fp4. I managed to get CK tests / examples
working for all of these scenarios, so the basic implementation should
be correct. I expect some more tweaks will be required to get full
support, some of which I already anticipated in the section "New
issues".

## Big switch: USE_NEW_UNIFIED_FRAMEWORK
When USE_NEW_UNIFIED_FRAMEWORK is 1, we replace all WarpGemms with
MmaPipelines from the new unified framework. This means
WarpGemmDispatcher will use the UnificationDispatcher instead of the
regular Dispatcher. Furthermore, named WarpGemms like
WarpGemmMfmaF32F32F32M16N16K4 will also get rerouted to the
UnificationDispatcher. The latter is necessary because some pipelines
bypass the WarpGemmDispatcher in favor of directly using named
WarpGemms.

For now the switch is turned on for easier testing, so don't expect the
CI to pass. When off, this MR should not affect any of the CK tile tests
at all so I *would* expect the CI to pass.

## Simplification of MmaPipelineBase
I found that the structure of MmaPipelineBase was a bit complex and I
was able to reduce it a lot. The only thing an Mma Pipeline does
(currently) is provide a wrapper around amdgcn structs that allows k
iteration and sparse compression. We don't allow M and N composition for
now for simplicity and since this is not expected from WarpGemms in CK
Tile currently.

## Re-interpretation of tile distribution encodings for packed datatypes
Tile distributions for packed types are expected to describe
mathematical elements, not datatype elements! This distinction is why
the gfx950 fp4 CK_tile tests were not working. Updated the
interpretation in amdgcn_mma, tile distribution calculator, and layout
test, along with comments. Tested on all architectures.

## getCMakeCompilerTarget() for configuration time target architecture
This is a workaround because there are a lot of cases in CK Tile where
the host code inspects Device constructions like WarpGemm, and we need
to get the version that *will* be used on the device. This is a big
kludge and we need to figure out a better solution. Also this util will
always pick the *first* cmakelists target arch, so there will be issues
when compiling for multiple target architectures. Ideally, the host code
should not touch the WarpGemms at all, and there would be no issue. This
has been a point of friction in CK for a long time. We can discuss this
with Chris Millette.

## Tests
I was able to verify that the following CK Tile tests and examples work
with the new unified framework:
tile_tutorial_mfma_16x16x16 (gfx9, fp16, uses transpose)
tile_example_gemm_basic (gfx9, fp16)
test_ck_tile_mx_gemm_async (gfx950, microscaling fp8 and fp4)

Within the tile tutorial I was also able to use
WarpGemmMfmaF16F16F32M16N16K32TransposedCDistribution instead of
WarpGemmMfmaF16F16F32M16N16K16TransposedCDistribution to verify that
basic K iteration also works.

A little while ago I also verified that the performance did not change
in a measurable way, and the compile did not change *much* but did see
some swings up to 20% each way (faster or slower). We will need some
broader and more accurate tests for this going forward.

## Moving forward
To confidently be able to replace the existing Dispatcher and WarpGemm
framework with our own, we need to make sure that all existing tests and
examples work on all platforms. Furthermore, we should pay attention to
performance and compile time of all these tests. Performance should
definitely not change, as all we're doing is refactoring the support
structure around the intrinsics, which should melt away during
compilation.

## New issues
(I will make new issues with descriptions for these but here is a short
list (incomplete):

Test RDNA CK Tile pipelines
Test Sparse Ck Tile pipeline (does not exist but we can make one)
Remove MmaOp flags from unification framework and update it to work with
new WarpGemmParamsParser instead.
Add Swizzle support and test in CK Tile pipelines.
Test Scale + transpose Ck Tile pipelines.
Coherent strategy for attrnumaccess for dense, scale, default, packed,
wmma, gfx1250, etc in CK tile. It's messy now.
Dispatcher should not be determining scale-ness of intrinsics based on
MNK sizes.
Try adding back the MN composition in MmaPipelines
Why is test_amdgcn_wavewise_mma only compiled for CDNA?
Investigate NOP and AGPR flags
Maybe get rid of WmmaTag in dispatcher.
Find a coherent strategy for dealing with host vs device compile passes,
and the host sneaking a peak at WarpGemm internals. Related to
getCMakeCompilerTarget().

## TODO before merge
Some changes exist just for ease of testing, and will be reverted before
merging:
- gemm_basic.cpp has a lot of datatypes disabled because otherwise
compile time is huge for testing
- USE_NEW_UNIFIED_FRAMEWORK is set to 1 for easier testing
2026-06-24 13:35:25 +00:00
Illia Silin
bd3713c710 [rocm-libraries] ROCm/rocm-libraries#8716 (commit 8230b20)
[CK] [Security] remove allow-unauthenticated flag from
 dockerfile (#8716)

## Motivation

Dockerfile uses apt-get install --allow-unauthenticated which disables
APT GPG signature verification, allowing package installation without
cryptographic validation. An attacker who can perform a
man-in-the-middle attack on the build network (via corporate proxy, CI
egress, or compromised mirror) can serve trojaned .deb packages such as
libc6, cmake, or git that become embedded in published ROCm container
images distributed to users.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-23 18:12:18 +00:00
Enrico Degregori
55e30feac6 [rocm-libraries] ROCm/rocm-libraries#8637 (commit a1a7f5f)
[CK] Fix compilation

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-20 02:08:58 +00:00
Adel Johar
01bad4c3d9 [rocm-libraries] ROCm/rocm-libraries#8205 (commit f58120c)
[Docs] Standardize precision support reference pages across
 components (#8205)

## Motivation

The goal of this PR is to standardize the precision support reference
page format across all components, while also reducing the maintenance
of burden of having to manually update the YAML data file in
https://rocm.docs.amd.com/en/latest/reference/precision-support.html

## Technical Details

- Each component maintains its own YAML file which will be eventually
used in
https://rocm.docs.amd.com/en/latest/reference/precision-support.html
- A new precision support reference page is introduced which will not
override existing data type/precision support content; it will serve as
the overview/summary that will be linked in the ROCm reference page

## Test Plan

- Built locally, viewed each component manually

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 15:08:04 +00:00
Bartłomiej Kocot
7c2b979de2 [rocm-libraries] ROCm/rocm-libraries#8573 (commit 04c9f1d)
[CK][CK Tile] Drop profiler for experimental builder codegen
 (#8573)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Switch to dispatcher profiler for ck tile conv.

## Technical Details

- Switch to dispatcher profiler for ck tile conv.
- Drop profiler for experimental codegen
- Minor fixes for bwd data printing
- Minor fixes for 3d conv in dispatcher codegen

## Test Plan

test_grouped_conv*tile

## Test Result

Passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 09:38:44 +00:00
Enrico Degregori
2733e75900 [rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e)
[CK Tile] Async support pipeline V3

## Motivation

Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async
pipeline)

## Technical Details

- Add `Async` bool to `Problem` struct to enable async pipeline in
existing one
- Add `static_move_ys` to load transpose. This generates offset in
assembly instructions saving registers
- Add `is_valid` to `async_get_vectorized_elements`. Before hard coded
to true. It allows to support padding
- Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr`
(wider use of lds load transpose on gfx950)
- Integrate async support in existing V3 pipeline (avoid pipelines
duplication)
- Create policy to support both async and default cases. This could be
used by any async pipeline (next steps)
- Define `wg_attr_num_access` separately for A and B. This allows to
optimize ds_read instruction width for cases when one matrix is
transposed and the other is not. Before in such cases, `ds_read_b64` was
used instead of `ds_read_b128`
- Add test for V3 async. Currently only supporting cases with A and B
having the same type

## Test Plan

New test `test_ck_tile_gemm_pipeline_compv3_async`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 06:57:14 +00:00
Brock Hargreaves
081fe18c1c [rocm-libraries] ROCm/rocm-libraries#8558 (commit ccfa08b)
[CK][CI] Retry git network ops to survive transient DNS blips
 (#8558)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

CI builds intermittently fail on transient git DNS blips (e.g. `Could
notresolve host: github.com`). These surface as an untyped `exit code
1`, which the existing node/transient-fault retry doesn't catch — so a
momentary glitch fails the whole build.

## Technical Details

Added `gitNetRetry(label, body)` (3 attempts, 15s backoff) and wrapped
every github.com-touching git step: ref-repo clone/update, `checkout
scm`, and the hipTensor clone. All are idempotent on retry. Docker pulls
are left to the existing `pullImage()` path.

## Test Plan

- Mapped the failing build's `git remote update` DNS error to a
now-wrapped call.
- Confirmed no existing code retries git host-resolution failures.

## Test Result

Groovy shared-library — not locally executable; needs a pipeline run to
fully validate. Check CI.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 21:18:27 +00:00
Brock Hargreaves
8864dcc3a4 [rocm-libraries] ROCm/rocm-libraries#8560 (commit f8362a1)
[CK][CI] Post failure GitHub status on stage build errors
 (#8560)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Failed CI stages (e.g. Static checks) were left stuck on a `pending`
GitHub status instead of reporting `failure`, so PRs showed an overall
failure with no indication of which check actually failed.

## Technical Details

`buildAndTest` posted `pending`/`success` statuses but its catch only
rethrew, deferring failure reporting to `runOnHealthyNode` — which
deferred right back. Neither posted `failure`. This adds a `failure`
status post for real build errors in `buildAndTest`, while letting
node-reroute signals (`NodeFault`/`TransientFault`) and aborts
(`FlowInterruptedException`) propagate untouched so retries still work.
Since every stage routes through `buildAndTest`, this fixes both the
directly-called `Static checks` stage and the `runOnHealthyNode`-wrapped
per-arch build stages.

## Test Plan

Trigger a stage failure (e.g. introduce a clang-format violation) and
confirm the corresponding GitHub status context transitions `pending` →
`failure` rather than remaining `pending`.

## Test Result

Pending CI run on a branch with a deliberate failure to confirm the
status transition.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-18 21:16:24 +00:00
Brock Hargreaves
bad7870830 [rocm-libraries] ROCm/rocm-libraries#8508 (commit 5cc3bef)
[CK][CI] Make gfx1250 build compile-only

## Motivation

gfx1250 has no CI hardware, so its build piggybacks on gfx90a nodes
where gfx1250 binaries can be compiled but not run. The build currently
fails because post-build runtime tests fire on the gfx90a node. This PR
makes the gfx1250 build compile + install only.

## Technical Details

The post-build test block in `buildAndTest` (`ck.groovy`) keys off the
physical node arch (`gfx90a`), so runtime tests run for gfx1250. Gated
that block off for gfx1250. Body-only change with no signature changes,
so it's backward compatible with the develop-pinned shared library and
doesn't affect other archs.

  ## Test Plan

Trigger the gfx1250 build with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`
and confirm it compiles/installs with no runtime test steps; confirm
gfx90a builds are unchanged.

## Test Result

Check CI.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-18 18:33:59 +00:00
Sami Remes
a3a12b8945 [rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf)
[CK_TILE] Enable full transpose layout support for MX GEMM
 pipeline (#5813)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Enable full transpose layout support for MX GEMM pipeline (32x32x64
MFMA)

### Summary

This PR enables all four matrix layout combinations (Row/Col, Row/Row,
Col/Col, Col/Row) for the MX GEMM pipeline with `32x32x64` MFMA warp
tiles, using `ds_read_tr` transposed LDS loads on gfx950. Previously,
only the canonical `A=RowMajor, B=ColumnMajor` layout was supported.

### Changes

**Kernel-side transpose support:**

- **`warp_gemm_attribute_mfma.hpp`**: Introduce `kSplitFactor` logic in
`get_warp_dstr_encoding` to split the K-dimension distribution encoding
when `kPerLane` exceeds the `ds_read_tr` subtile minor dimension. This
satisfies the `TransposeTileDistributionTraits` suffix validation
required by `load_tile_transpose`. The distribution encoding now also
receives the `DataType` template parameter to compute the split factor
based on packed element size.

- **`gemm_pipeline_ag_bg_cr_comp_async.hpp`**: Uncomment and enable the
`InputTileDistributionTraits` logic to properly transform LDS load tile
distributions for transposed reads. Add `static_assert`s to catch
misconfigurations where a layout requires transpose loads but the warp
tile size disables them (e.g. `KWarpTile=128` exceeds `ds_read_tr`
limits).

- **`load_tile_transpose.hpp`**: Fix `DataVec` sizing for packed types
(`pk_fp4_t`) — divide `vecLoadSize` by `PackedSize` to prevent buffer
overflow when each physical element contains multiple logical values.

- **`warp_gemm_attribute_mfma_impl.hpp`**: Set `kDefaultScale` to
`0x7F7F7F7F` (unity in e8m0 format) for the unscaled `operator()`
overloads of `WarpGemmAttributeMfmaImpl_f32_32x32x64_f8f6f4`, ensuring
correct behavior with `mfma_scale_f32_32x32x64_f8f6f4`.

- **`warp_gemm.hpp` / `warp_gemm_dispatcher.hpp`**: Add generic
`WarpGemmMfma_f32_32x32x64_f8f6f4<A, B>` alias and dispatcher
specialization to support arbitrary MX data type combinations (fp4, fp6,
fp8) with the 32x32x64 MFMA, consolidating the existing type-specific
aliases.

- **`gemm_pipeline_ag_bg_cr_comp_async_default_policy.hpp`**: Simplify
`wg_attr_num_access` determination — `Double` for fp8, `Single`
otherwise.

**Reference implementation fix:**

- **`reference_gemm.hpp`**: Fix nibble selection for packed 4-bit types
(`pk_fp4_t`, `pk_int4_t`) in `reference_mx_gemm`, `reference_gemm`, and
`reference_gemm_abquant`. The previous logic used `k % 2` or
`index[K_DIM] & 1` to select which nibble to extract, which assumed K
was always the fast (contiguous) memory dimension. This is only true for
`A=RowMajor` / `B=ColumnMajor`. For other layouts, the fix computes the
flat memory offset via `mDesc.GetOffsetFromMultiIndex(...)` and uses its
parity to correctly select the nibble regardless of layout.

**Test infrastructure:**

- **`test_mx_gemm_config.hpp`**: Add `MxGemmConfig32` base and
`MXfp4_GemmConfig32` / `MXfp8_GemmConfig32` configs for the 32x32x64
warp tile.
- **`test_mx_gemm_fp4.cpp` / `test_mx_gemm_fp8.cpp`**: Add `Config32`
test suites covering all four layout combinations. Restrict `Config16`
(16x16x128) to `A=Row, B=Col` only, since `KWarpTile=128` exceeds
`ds_read_tr` limits.
- **`test_mx_gemm_util.hpp`**: Fix scale tensor layout — scales are
always row-major `[M, K/32]` and column-major `[K/32, N]`, independent
of A/B data layout.

### Test plan

- [x] `test_ck_tile_mx_gemm_fp4` — 5/5 passed (16x16x128 Row/Col +
32x32x64 all 4 layouts)
- [x] `test_ck_tile_mx_gemm_fp8` — 5/5 passed (16x16x128 Row/Col +
32x32x64 all 4 layouts)
- [x] `test_ck_tile_mx_gemm_fp6` — 1/1 passed (16x16x128 Row/Col)
2026-06-18 17:05:09 +00:00
Illia Silin
e2deaaba64 [rocm-libraries] ROCm/rocm-libraries#8591 (commit 5210ae6)
[CK] fix daily hipTensor tests.

## Motivation

Had to change the way hipTensor is cloned to make sure it doesn't erase
CK installation and uses the correct path for the installation. Also
added the "install" target every time we build and test everything, so
we could use CK for testing third-party libs that depend on it.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 14:58:10 +00:00
Enrico Degregori
1762eaeaec [rocm-libraries] ROCm/rocm-libraries#8535 (commit a0f47eb)
[CK Tile] EightWaves pipeline int8 support

## Motivation

EightWaves pipeline currently is supporting only FP types

## Technical Details

 - Enable 16x16x64 int8 instruction for gfx950 in dispatcher
 - Enable int8 in EightWaves pipeline
 - Add tests
 - Fix bug in `warp_gemm_attribute_mfma_impl.hpp`

## Test Plan

Tests have been added for int8 GEMM using EightWaves pipeline

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 12:59:59 +00:00
Ville Pietilä
60b276647b [rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)
[CK Tile] Rule-based configuration generation in CK
 Dispatcher codegen (#8157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile Dispatcher code generation for CK Tile Profiler relies on
flat JSON files to list the generated configurations. This approach has
the following problems

- The JSON files are verbose
- The JSON files get easily out of sync with the CK Builder .config
files from which they were generated from.
- The JSON file based configuration make it hard to list explicitly the
rules that govern the instance generation.

## Technical Details

Replaced the JSON files with a rule based configuration. To preserve the
existing functionality, the `profiler` and the `tests` instance sets are
generated directly from the CK Builder config files. The JSON config
files are removed from source control, and the "on-the-fly" generation
guarantees that the Dispatcher codegen uses up to date configurations.

This is PR introduces six different rule sets for the CK Tile Dispatcher
code generation

1. `profiler`: matches with the old JSON set of profiler configurations.
2. `tests`: matches with the old JSON set of tests configurations.
3. `full`: full configuration set created from a rule-based config
selection
4. `full-tests`: a subset of `full` for generating configurations for
convolution integration tests.
5. `tiny`: a subset of `full-tests` to produce the minimal set of
configurations to test the Dispatcher codegen.
6. `default`: the default rules, which corresponds to the existing
heuristic rules for configuration selection. This ensures that ML based
kernel selection doesn't get broken.

The main use of the `full` rule set is to define a reasonable solution
space for the possible implicit GEMM configurations. We start from the
configurations that allowed by the device architecture. The `full` rule
set defines the relevant tile sizes for each convolution direction. From
the tile size we have a curated mapping to the number of waves over the
different GEMM axes, i.e., we describe how many waves each GEMM
dimensions corresponds to. The GEMM-K wave tile dimension can be
computed from the other parameters and does not need to be listed
explicitly.

An orthogonal axis to the tiling strategy is the vectorization strategy.
This mainly defined by the data type and hardware as in general, we want
to use the maximum possible load widths. The maximum sizes for each
convolution direction variant are defined by the implicit GEMM matrix
dimensions. For cases where have a low number of channels per
convolution group, we need smaller vector load sizes. These are captured
by the `VecStrategy` enumeration in the codegen rules.

The problem with the rule based configuration selection is that we "over
generate" configurations. The old JSON configurations compose
approximately 25% of all configuration that the `full` rule set creates.
The additional configurations are valid, but they many not provide any
performance benefits. Hence, we keep the `profiler` and `tests` rule set
for now to avoid building an excessive amount configurations by default.
The `full` rule set can be taken into use by specifying CMake
configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the
`tests` rule set is used, i.e., we don't change the existing bahaviour.

## Test Plan

Added a new stage in the CI/CD pipeline that ensures the Dispatcher
codegen rules are up to date. Otherwise the functionality is covered by
the existing CI/CD tests. There are no functional changes to the
convolution kernels. Only how the different instances are generated.

## Test Result

If the CK Tile conv instances build without errors, the Dispatcher
codegen is generating valid code. If all tests in CI/CD pipeline are
passing, the Dispatcher codegen generates valid instances.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 01:22:50 +00:00
Aviral Goel
c43b550206 [rocm-libraries] ROCm/rocm-libraries#8202 (commit 0911fa0)
[GFX1250][CK_TILE] Add scale16 (ScaleBlockSize=16) support to
 MX GEMM TDM pipeline (#8202)

Enables `ScaleBlockSize=16` end-to-end for the FP8/BF8 MX GEMM TDM
pipeline, building on the scale16 warp-gemm layer already in develop.

- **warp gemm:** add the 32x32x128 f8f6f4 scale16 traits and alias (2x2
grid of 16x16x128 scale16 intrinsic calls with per-subtile
`SCALE_OPSEL`), and route 32x32 f8f6f4 through the dispatcher's
`IsScale16` path.
- **default policy:** select the warp gemm via the dispatcher with
`IsScale16=(ScaleBlockSize==16)` so `WarpTile=16` and `WarpTile=32` each
pick the matching scale16 path; guard WarpTile M/N to 16 or 32;
scale-tile distribution for the scale16 layout.
- **pipeline V1/V2:** thread `Problem::ScaleBlockSize` through the
scale-window setup (replacing the hardcoded 32); expose `ScaleBlockSize`
for the kernel.
- **block gemm:** extract int64 (scale16) / int32 (scale32) scales by
width.
- **kernel:** scale16 descriptor order; reject unsupported
`BlockScaleSize`.

Test coverage for this path is in the stacked follow-up PR.
2026-06-17 16:41:00 +00:00
jakpiase
65bef78383 [rocm-libraries] ROCm/rocm-libraries#8518 (commit 1ad69c3)
[CK] Add support for large tensor index handling into conv
 bwd data (#8518)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-17 15:51:36 +00:00
Illia Silin
b5713be6cd [rocm-libraries] ROCm/rocm-libraries#8501 (commit 54eb5dc)
[CK] disable DPP kernels by default

## Motivation

The dpp8 instruction has been disabled in the upstream llvm-project in
the latest compiler version, so we're hitting compilation errors with
staging compiler:
<inline asm>:2:33: error: not a valid operand.
v_dot2c_f32_f16_dpp v6, v8, v7 dpp8:[0, 0, 0, 0, 0, 0, 0, 0]
                                ^
error: cannot compile inline asm
These instructions are used for fp16 gemms that are slightly faster than
dl gemms on gfx10, but are not critical.

Going to disable these kernels for now, until a better solution is
available, to unblock the builds with staging compiler.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-17 14:03:00 +00:00
SamiAario-AMD
39182b50eb [rocm-libraries] ROCm/rocm-libraries#8487 (commit 06a73ba)
Skip tests on gfx11 that have intermittent failures

## Motivation

On gfx11, skip sporadic failures for any load_and_convert_tile case
where X and Y
differ. Same-type tuples (half/half, bf16/bf16, fp8/fp8) have been
stable.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

Stress-tested on gfx11, gfx12, and gfx950 with 10000 iterations of the
tests. No remaining test failures were detected.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-17 11:07:22 +00:00
damien-lejeune
5bebfd460f [rocm-libraries] ROCm/rocm-libraries#8492 (commit 46b6a06)
Add tile size for FMHA batch prefill bf16 for MI308X

## Motivation

Adding a tile size adapted to MI308X, for the FMHA Batch Prefill BF16
input type case

## Technical Details

N/A

## Test Plan

Benchmarking from the Aiter side with:

```
python3 op_tests/test_batch_prefill.py  -s 8000 -p 1 -q 4 -k 1 --head_dim 256 -c true -d bf16 --input_dtype bf16 --quant_method none --kv_layout linear -t sglang -l 0.0 --return_lse false --profile
```

## Test Result

We see an improvement with the new tile size on MI308X (both with PLT
mode OFF and ON)

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-06-17 06:22:26 +00:00
damien-lejeune
2c0b7cbb0a [rocm-libraries] ROCm/rocm-libraries#8424 (commit debb669)
Add missing constraint in the FMHA qr async pipeline to
 enforce bk0=bk1  (#8424)

## Motivation

The purpose of this change is to add a guardrail to what values bk0 and
bk1 can take. This is to avoid ill defined sizes, silently failing and
generating NaN (or other error) at runtime.

An example of such failure can be obtained using the tile engine:

```
cd rocm-libraries/projects/composablekernel/tile_engine/ops/fmha
python fmha_benchmark.py configs/batch_prefill.json \
  --problems "1,4,1,8000,8000,256" \
  --filter "c.data_type=='bf16' and c.hdim_q==256 and c.pipeline=='qr_async' and c.mode=='group' and c.tile_n0==32 and c.tile_k0==64"
 ```

## Technical Details

The qr_async pipeline stages data in the K dimensions into LDS using a bk1-descriptor, while the (Q*K^T) gemm0 consumes bk0

## Test Plan

See command above

## Test Result

Before the change: (invalid) generate instances, error at runtime
After this change: no instance generated

## Submission Checklist

- [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-06-16 07:41:58 +00:00
Brock Hargreaves
1b649a8d4b [rocm-libraries] ROCm/rocm-libraries#8332 (commit 48c389c)
[CK][CI] Retry builds on node failure with automatic
 rerouting (#8332)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

When a Jenkins node enters a bad state (missing GPU driver, dead Docker
daemon, full disk), every PR scheduled onto it fails the same way until
a human manually takes it offline. Some failures are also transient and
would pass on a simple retry. Today the pipeline does neither — every
failure goes straight to red on the same node.

## Technical Details

Two new retry behaviors based on failure type:
- **Different node** for persistent node faults (driver missing, daemon
down, disk full, container won't start)
- **Retry in place** for transient glitches (registry pull, DNS), then a
different node if retries are exhausted

Real build/compile failures and aborted builds are never retried.

**New:** `src/org/ck/NodeFault.groovy`, `TransientFault.groovy` — typed
exceptions in the shared library `src/` for stable classloader identity
under dynamic library loading.

**`vars/ck.groovy`:** adds `preflight()` (host health checks before
build), `pullImage()` (classifying pull failures at the call site,
replacing `getDockerImage()`), `runOnHealthyNode()` (outer reroute loop,
up to 3 nodes), `runInPlace()` (same-node transient retries). GitHub
failure status is only set once all retries are exhausted.

**`Jenkinsfile`:** all active `Build CK and run Tests` stages converted
to `agent none` + `ck.runOnHealthyNode(…)`.

## Test Plan

Tested on `users/brockhargreaves-amd/ck/node-failure-retry-logic` with
`USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`. Verified preflight logging,
reroute on node fault, attempt counter in logs, no retry on aborts, and
single failure status report after budget exhausted.

## Test Result

Retry logic working as expected. Three bugs found and fixed during
testing: false `NodeFault` from host-level sccache probe (sccache is
in-container), `null` node name in catch logging, and `sh` calls outside
`node()` context in status reporting.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-15 17:40:10 +00:00
Andriy Roshchenko
b8440b3aeb [rocm-libraries] ROCm/rocm-libraries#8325 (commit 559eaf6)
[GFX1250][MX GEMM] Unified FLATMM GroupedGemm Implementation
 for MX Data Types (#8325)

## Motivation

Design and test a unified FLATMM GroupedGemm interface so that it
supports all MX FP8, FP6, and FP4 data types on both the gfx950 and
gfx1250 architectures and works seamlessly across these platforms.

## Technical Details

Implementation exposes Grouped Gemm interface for MX FLATMM and MX TDM
FLATMM pipelines.

## Test Plan

Add the following tests:
 - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_non_tdm.cpp
 - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_tdm.cpp
 - ck_tile/flatmm/test_mx_flatmm_persistent.cpp

Verify on the gfx950 and gfx1250 architectures.

## Test Result

All tests pass. Verified on A0 hardware with rocm-7.14.0a20260517

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-15 16:12:33 +00:00
Sami Remes
c1f7104852 [rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01)
[CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K,
 unsupported-shape guard (#6663)

Three independent MX GEMM correctness bugs reported against
example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one
host-side atomic-add accumulation bug in the example's repeat loop.

- Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop
required num_loop > PrefetchStages, which let num_loop == 3 enter a hot
loop that produced 5 gemm accumulations instead of 3 (K == 3*K_Tile,
e.g. K=768, deterministically wrong). Require num_loop >= 4 instead:
pre-pipeline + TailNumber::Three already totals exactly 3.

- Kernel (gemm_mx_kernel.hpp): split-K was silently broken because
GridSize did not thread k_batch into blockIdx.z and the scale tile
windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore
read the wrong packed scales. Fix:
* GridSize returns dim3(grid_x, 1, k_batch) (persistent and
non-persistent).
* MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a
packed-scale K offset (also apply pad_tensor_view so OOB scale loads
return zero, matching A/B padding).
* operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset
(matches Underlying::SplitKBatchOffset's K1-aligned formula), and
dispatches the epilogue with memory_operation_enum::atomic_add for
k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard
as UniversalGemmKernel.
* MakeCBlockWindows templated on DstInMemOp; unconditionally applies
pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are
handled correctly.

- Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp):
add IsSupportedArgument and a static_assert for configurations that
produce silent wrong results:
* static_assert(!kPadK) -- the MX comp-async pipeline uses
async_load_tile whose OOB check is per-vector-start, so a vector
straddling the K pad boundary reads garbage. Until the async path learns
per-element pad masking, reject kPadK at compile time.
* Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when
kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always
(CShuffleEpilogue cannot safely run with a single partial tile); K %
(KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of
WarpTile_K * k_batch so every split lands on a packed-scale boundary.
  * All error paths log under CK_TILE_LOGGING with actionable messages.

- Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp):
* Call Kernel::IsSupportedArgument up front and throw a clear
runtime_error for rejected shapes (was silently launching an unsupported
kernel).
* Switch to launch_kernel_time_mask with a clear_gemm_output preprocess
that zeroes C between iterations when k_batch > 1 (mirrors
universal_gemm_invoker). Without this the default -warmup=50 -repeat=100
accumulated 150 atomic_adds into C after the kernel-side split-K fix.

Tests (test/ck_tile/gemm_mx/):
- Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true).
- test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression),
SplitK (k_batch=2,4 across full_k/partial_k paths),
TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding}
covering trailing partial tiles along M, N, or both.
- Run(...) now takes k_batch.
- packScalesMNxK: guard against OOB (mn, k) reads from src and
initialise e8m0 bytes to the zero exponent (0x00) instead of the
default-constructed NaN (0xFF), so padded lanes don't poison the packed
int32_t shared with in-range lanes.
- test_mx_gemm_instance.hpp: call IsSupportedArgument before launch.

Verification on gfx950, ROCm 7.2.0:
- ctest -R test_ck_tile_mx_gemm -> 100% (2/2).
- Example sweep over the original bug-report shapes: all K-aligned
shapes now validate correct (including 4096^3 sk=2 and the K=768 cases);
all K=128 shapes cleanly rejected with the new error message instead of
producing silent wrong results.

Made-with: Cursor

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-15 08:28:55 +00:00
damien-lejeune
aab1d219f5 [rocm-libraries] ROCm/rocm-libraries#8350 (commit f92ded1)
Add tile shape for FMHA batch prefill on MI308X (on fp8,
 hdim=256) (#8350)

## Motivation

Add a tile size appropriate for FMHA batch prefill fp8/hdim256 on MI308X

## Technical Details

Appending the tile shape to the existing factory such that it can be
picked up by Aiter

## Test Plan

Ran the performance test on both MI300X and MI308X

## Test Result

MI300X performance seems unaffected by this change. MI308X does improve.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-06-15 07:00:35 +00:00
SamiAario-AMD
947dcc2606 [rocm-libraries] ROCm/rocm-libraries#5510 (commit 8415c8c)
[CK Tile] Add transposed tile load implementation, and tests
 for load_and_convert_tile (#5510)

## Motivation

Mixed precision b/fp16 x fp8 requires a transposed tile load
implementation that supports mixed precision using these types.
Implement this, use it in `load_and_convert_tile`, and add a unit test
for `load_and_convert_tile` which covers this functionality.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-15 06:42:28 +00:00
ltqin
0954a8f3fa [rocm-libraries] ROCm/rocm-libraries#8262 (commit d4ff8fc)
[CK_TILE] Add graph capture support for FMHA backward(new
 branch) (#8262)

## Motivation
Add HIP graph capture support for FMHA backward operations. The original
implementation only supported normal execution mode and would cause
use-after-free crashes when used with graph capture replay.
When FMHA backward is captured into a HIP graph:
- First replay: host callback executes and deletes the closure (as
designed for normal mode)
- Subsequent replays: use-after-free crash because the closure was
already freed
This PR enables `fmha_bwd_launcher::prepare_workspace_async()` to work
correctly in both normal execution and graph capture modes.
2026-06-14 03:11:53 +00:00
Johannes Graner
01cca38c8e [rocm-libraries] ROCm/rocm-libraries#8220 (commit 4c04a3a)
[CK Tile] WAVELET pipeline for backward-data grouped
 convolution (#8220)

## Motivation

On the RetinaNet shapes (gfx950, fp16) CK Tile backward-data conv was
~18% behind classic
CK, with the gap concentrated in the K=2376 3x3 detection-head family
where bwd_data spends
most of its time. The WAVELET GEMM pipeline already gives uplift for
forward and
backward-weight conv; this ports it to backward-data and consolidates
the now-shared
machinery across all three directions.

## Technical Details

- Backward-data wavelet support in the tile kernel: launch extra load
waves when the
pipeline exposes `LaunchBlockSize`, and split the epilogue into math
waves (run the
  CShuffle epilogue) and load waves (`RunBarrierStub`).
- Register 7 WAVELET instances (fp16 and bf16), tuned for
backward-data's tall-skinny GEMM
rather than the forward tile shapes: a big-M `256/128/64` workhorse, a
`VecA=4` variant for
the `K % 8 != 0` shapes, and a `NumGroupsToMerge=32` variant for grouped
(depthwise-style)
  shapes.
- Implement the native backward-data instance parser in
`generate_instances.py`.
- Deduplicate the wavelet machinery shared by forward, backward-data,
and backward-weight:
`GroupedConvLaunchBlockSize`, `is_wavelet_pipeline`, and
`RunWaveletAwareEpilogue` in
`grouped_convolution_utils.hpp`; the three native instance parsers
collapse to one
  parameterized parser. The three kernels now call the shared helpers.

## Test Plan

- Rebuild the full profiler instance pools for all three directions
(fp16/bf16/fp32,
nhwgc/ndhwgc) to exercise the shared helpers across every instantiation.
- Tile GTests on gfx950: `test_grouped_convnd_fwd_tile`,
`test_grouped_convnd_bwd_data_tile`,
  `test_grouped_convnd_bwd_weight_tile`.
- Per-shape sweep of the 35 RetinaNet backward-data shapes vs classic CK
and the
non-wavelet tile pool (`profile_wavelet_bwd_data.py`); correctness
spot-checked with
GPU-reference verification on the new big-M and NumGroupsToMerge
instances.

## Test Result

- GTests pass: forward 9/9, backward-data 6/6, backward-weight 6/6.
- Backward-data perf (3x3 g=1 region, geomean classic/tile): 0.88 ->
1.11, i.e. the tile
path goes from ~12% slower than classic to ~8% faster. The largest
single backward-data
shape (256x100x100->2376) moves from 11% slower than classic to 12.5%
faster.
- The dedup refactor preserves behavior (net -174 lines across the
kernels/generator),
  confirmed by the full rebuild and the GTests above.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-13 00:10:50 +00:00
John Afaganis
329e589840 [rocm-libraries] ROCm/rocm-libraries#8260 (commit 1139236)
[ck] Enforce LF-only line endings in C/C++ sources
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Several CK source files carry Windows **CRLF** line endings (a trailing
carriage return on each line), introduced by editors configured for
Windows endings or copy/paste from Windows tooling. These are purely
cosmetic but they pollute diffs (whole-file churn the first time someone
makes an LF edit), confuse `clang-format`, and are inconsistent with the
LF-only convention used across the rest of the tree.

This PR (a) normalizes every existing CRLF file (6 files) to LF and (b)
adds a pre-checkin gate so new CRLF leaks are rejected before merge.

## File extensions covered

Both the cleanup scan and the new Jenkins enforcement stage use the same
predicate as the adjacent `ASCII Only Check` stage:

```
*.h  *.hpp  *.cpp  *.h.in  *.hpp.in  *.cpp.in  *.inc  *.cl
```

(excluding `*/build/*` and `*/include/rapidjson/*`). The local
pre-commit hook's `c++/inc` type filter covers the same set.

## Why no enforcement today

CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so
the existing `pre-commit` workflow doesn't touch CK. The local CK
`.pre-commit-config.yaml` only runs for developers who installed hooks.
The **authoritative gate is therefore the new Jenkins stage** in this
PR; the local hook is convenience.

## Commit layout (bisect-friendly)

1. `[ck] Normalize CRLF line endings to LF in C/C++ sources`
Mechanical line-ending cleanup across 6 files. No content change: every
edit is purely CRLF -> LF, verified with `git diff --ignore-cr-at-eol`
reporting an empty diff.

2. `[ck] Enforce LF-only line endings in C/C++ sources`
- New `projects/composablekernel/script/check_no_crlf.sh` (modeled on
`check_ascii_only.sh`).
- New `crlf-checker` entry in
`projects/composablekernel/.pre-commit-config.yaml` under the
local-hooks block (`types_or: [c++, inc]`).
- New `CRLF Check` parallel stage in
`projects/composablekernel/Jenkinsfile`'s `Static checks` block,
mirroring the adjacent `ASCII Only Check` stage. Always-on, no
`RUN_CPPCHECK` gate.

The tree is buildable at every commit boundary. Commit 1 leaves 0 CRLF
violations; commit 2 wires the gate.

## Demo

Script output on a synthesized violation:

```
$ printf 'int main() {}\r\n' > /tmp/bad.cpp
$ projects/composablekernel/script/check_no_crlf.sh /tmp/bad.cpp
ERROR: /tmp/bad.cpp contains CRLF (Windows) line endings:
1:int main() {}<CR>
  Fix: convert to LF, e.g. 'sed -i 's/\r$//' /tmp/bad.cpp' or 'dos2unix /tmp/bad.cpp'
$ echo $?
1
```

Full repo scan after the cleanup commit:

```
$ cd projects/composablekernel && find . -type f \( -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \
    -o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' \) \
    -not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \
  | xargs -0 -P 8 -n 64 script/check_no_crlf.sh
$ echo $?
0
```

## Test plan

- [ ] Jenkins PR build: confirm new `Static checks -> CRLF Check` stage
runs green over the full predicate and the existing `ASCII Only Check` /
`Clang Format` stages are unaffected.
- [ ] Local: `pre-commit run crlf-checker --all-files` runs cleanly
after installing CK pre-commit hooks.
- [ ] Manually inject a CRLF line ending in any `.cpp/.hpp/.inc` file,
push: confirm Jenkins fails the new stage with a clear error.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-12 21:11:59 +00:00
Brock Hargreaves
96a7e44832 [rocm-libraries] ROCm/rocm-libraries#8378 (commit d68585d)
[CK] Pre-emptively add groovy/ folder and skip TheRock CI
 filter (#8378)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Groovy library is growing and will be reorganized into a
self-describing `groovy/` folder rather than living under `src/` and
`vars/`. This PR creates that folder pre-emptively and adds it to the
TheRock CI skip-list so that future Groovy additions do not
unnecessarily trigger TheRock builds.

## Technical Details

- Added `projects/composablekernel/groovy/` with a `.gitkeep` to
establish the directory in the repo.
- Added `"projects/composablekernel/groovy/*"` to
`SKIPPABLE_PATH_PATTERNS` in `.github/scripts/therock_configure_ci.py`
alongside the existing `vars/*` entry, ensuring changes confined to
Groovy pipeline code are recognized as non-therock-relevant and skip the
TheRock CI pipeline.

## Test Plan

No code logic was changed. Verified that `therock_configure_ci.py`
pattern list is consistent with the existing `vars/*` skip entry and
that the new pattern follows the same glob convention.

## Test Result

N/A — directory scaffolding and CI filter only; no functional code
affected.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-12 20:11:53 +00:00
Illia Silin
d450749933 [rocm-libraries] ROCm/rocm-libraries#8357 (commit 800965c)
[CK] Re-enable HIPRTC codegen tests for all CK PRs.

## Motivation

At the time when we introduced the smart test filter to only build and
run tests affected by the PR changes, we disabled the client examples,
which required full CK build, and also the hiprtc tests that were
grouped with the client examples. This caused a few PRs to sneak through
that caused the hiprtc compilation to fail.
By restoring the hiprtc tests in all PRs, we should close this gap.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-12 19:19:44 +00:00
Illia Silin
789ef38093 [rocm-libraries] ROCm/rocm-libraries#8333 (commit 69b3fc1)
Revert "[CK_TILE] Implement RTC API for a subset of FMHA
 functionality for MGX" (#8333)

Reverts ROCm/rocm-libraries#6086
Need to revert as the codegen test for fmha is failing due to including
std header:

2026-06-11T22:36:03.673Z] In file included from
/tmp/comgr-953928-0-473822/include/ck/host/device_fmha_fwd/fmha_fwd_wrapper.hpp:8:
[2026-06-11T22:36:03.673Z] In file included from
/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/cmath:49:
[2026-06-11T22:36:03.673Z] In file included from
/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_abs.h:38:
[2026-06-11T22:36:03.673Z] /usr/include/stdlib.h:32:10: fatal error:
'stddef.h' file not found
[2026-06-11T22:36:03.673Z]    32 | #include <stddef.h>
[2026-06-11T22:36:03.673Z]       |          ^~~~~~~~~~

The ck_tile headers were never prepped for hiprtc compilation.
2026-06-12 18:19:31 +00:00
Wojciech Laskowski
c2601f38b7 [rocm-libraries] ROCm/rocm-libraries#6569 (commit 393049e)
Adding amdgcn_mma specializations for sparse MFMA builtins
 (#6569)

## Motivation

This PR is part of the [WMMA/MFMA] unification work. It's the fourth of
the series of PRs (after
https://github.com/ROCm/rocm-libraries/pull/5801,
https://github.com/ROCm/rocm-libraries/pull/6014 and
https://github.com/ROCm/rocm-libraries/pull/6567) that add all the
necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse
MFMA intrinsics.

## Technical Details

This change adds new specializations for MFMA sparse builtins. In total,
we add 27 MFMA builtins.

## Test Plan

All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.

## Test Result

Test pass locally, waiting for the CI.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-12 12:48:29 +00:00
Enrico Degregori
e75076c826 [rocm-libraries] ROCm/rocm-libraries#8310 (commit 003bc6b)
[CK Tile] Fix assert usage MX GEMM

## Motivation

See issue https://github.com/ROCm/rocm-libraries/issues/8223

## Technical Details

 - Use `std::runtime_error` in `mx_processing.hpp`
 - Use `static_assert` in `tensor_shuffle_utils.hpp`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-12 11:42:38 +00:00
Thrupti Raj Lakshmana Gowda
d7609923b6 [rocm-libraries] ROCm/rocm-libraries#7919 (commit 061001d)
Users/tlakshma/ck/tile engine develop

## Motivation

This PR adds multiple new GPU kernel benchmarking operations to the CK
Tile Engine, expanding its coverage of GEMM-family operations:

- **gemm_multi_abd**: GEMM with multiple A, B, and D tensors, enabling
epilogue patterns such as scale/bias fusion.
- **batched_contraction**: Batched tensor contraction supporting
multi-dimensional batch (G), M, N, and K dimensions, targeting workloads
where the contraction indices span more than one logical axis.
- **mx_gemm**: MX-format GEMM with microscaling (e8m0) scale tensors.
- **gemm_rowcolquant**: Block-scale GEMM with row/column quantization.
- **gemm_tensor_quant**: Block-scale GEMM with tensor quantization.
- **grouped_gemm_rowcolquant**: Grouped GEMM with row/column
quantization.
- **grouped_gemm_tensorquant**: Grouped GEMM with tensor quantization.
- **batched_gemm**: Batched GEMM benchmarking support.

## Technical Details

### gemm_multi_abd

  - New subdirectory: tile_engine/ops/gemm/gemm_multi_abd/
- CMakeLists.txt follows the same individual-target pattern as
gemm_universal / gemm_multi_d.
- gemm_multi_abd_instance_builder.py subclasses GemmKernelBuilder from
the shared gemm_instance_builder.py.
- gemm_multi_abd_benchmark.py delegates to the shared GemmBenchmark
parent class.
- Configs: default_config.json, default_ci_config.json,
user_provided_config.json.
  - Supported GPU targets: gfx90a, gfx942, gfx950, gfx1201.

### batched_contraction

  - New subdirectory: tile_engine/ops/gemm/batched_contraction/
- Extends GemmKernelBuilder via BatchedContractionKernelBuilder, adding
num_dim_g, num_dim_m, num_dim_n, num_dim_k, num_d_tensors, and
elementwise_function parameters.
  - Layout string uses 3-character encoding (A+B+E), e.g. rcr.
- Self-contained benchmark sweep driver
(batched_contraction_benchmark.py) with JSON/CSV export and best-kernel
selection.
  - Supported GPU targets: gfx90a, gfx942, gfx950.

### mx_gemm

  - New subdirectory: tile_engine/ops/gemm/mx_gemm/
  - Supports MX-format (e8m0) microscaling for A and B scale tensors.

### block_scale_gemm (gemm_rowcolquant, gemm_tensor_quant)

  - New subdirectory: tile_engine/ops/gemm/block_scale_gemm/
  - gemm_rowcolquant: row/column quantization epilogue.
  - gemm_tensor_quant: tensor-level quantization epilogue.

### grouped_gemm_quant (grouped_gemm_rowcolquant,
grouped_gemm_tensorquant)

  - New subdirectory: tile_engine/ops/gemm/grouped_gemm_quant/
  - grouped_gemm_rowcolquant: grouped GEMM with row/column quantization.
  - grouped_gemm_tensorquant: grouped GEMM with tensor quantization.

### batched_gemm

  - New subdirectory: tile_engine/ops/gemm/batched_gemm/
- Batched GEMM benchmark support wired into the sampling/active-op
lists.

All new ops are registered in op_weights.json for budget allocation and
wired into the active-op sampling lists in CMakeLists.txt.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-11 20:38:38 +00:00
jefyang1
276863ca87 [rocm-libraries] ROCm/rocm-libraries#8259 (commit df03f10)
Add cluster launch in test ck_tile mx gemm tdm wmma

## Motivation

Add cluster launch test in test_ck_tile_mx_gemm_pipeline_tdm_wmma on
gfx1250, so that we can check the performance on gfx1250 hardware.

## Technical Details

Added Out-of-bounds guard in RunGemm of MxGemmKernel to skip blocks
padded by cluster alignment.

Add ClusterEnable/ClusterDisable aliases and extend the tuple in
test_mx_gemm_pipeline_kernel_types.hpp by adding two kernel types with
ClusterEnable for F8 CompTDMV1 and CompTDMV2 respectively. The existing
F4 non-ClusterLaunch kernel types have issue to be fixed, so this PR
does not include F4 cases.

Read ClusterLaunch from the tuple in test_mx_gemm_pipeline_util.hpp.

Update invoke_mx_gemm to branch on ClusterLaunch, including Add cluster
size constants, Switch GemmShape type, TilePartitioner type, and the
kernel launch call.

## Test Plan

Tested the changes on gfx1250 FFM.

## Test Result

The added kernel types (instances) passed the tests on gfx1250 FFM.

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-11 17:33:11 +00:00
music-dino
359f664b25 [rocm-libraries] ROCm/rocm-libraries#6086 (commit d25d8cc)
[CK_TILE] Implement RTC API for a subset of FMHA
 functionality for MGX (#6086)

## Motivation

Introduce a wrapper for the FmhaFwdKernel, for use in real time
compilation in MIGraphX.

## Technical Details

The intent of the API is to provide multiple instances of the
FmhaFwdKernelWrapper, suitable for a particular problem definition.
At the moment the wrapper only supports bias and causal masking, feature
expansion will come in a future pr.
The usage pattern is, in short:

1.  Define fmha_fwd::Problem (input dimensions, data type, etc)
2. Fetch Solutions for target architecture (currently only gfx942) based
on Problem.
The solutions contain a map of template -> template parameter and can be
converted to a string representing the full instantiation of
FmhFwdKernelWrapper e.g. `ck_tile::FmhaFwdWrapper<ck_tile::fp16_t, 128,
64, 16, 32, 32, 32, 4, 1, 1, 4, 1, 1, 32, 32, 16, 32, 32, 16, false,
true, false, true, true, true, true, ck_tile::FmhaPipelineTag::QR>`
3. The instance can then be used in an RTC kernel. The kernel needs to:
* Construct a Descriptor (containing descriptions of all input tensors)
* Call IsValid() on the descriptor to check if the instance is
applicable. Note that this is constexpr by design so that it can fail
the kernel compilation as a signal that the kernel is not applicable.
    * Pass the descriptor and input pointers to the wrapper Run method.

A more detailed example of usage can be found in
codegen/test/fmh_fwd.cpp

Beside work on creating the wrapper and the supporting API, the PR also
contains some changes necessary to enable compilation with HIPRTC.
The contents of the CK tile headers are embedded in a binary file which
is used to pass the header files as strings to HIPRTC.
Many of the ck tile headers contain host only code which leads to
compilation failures.
ck_tile_headers_preprocessor goes through the embedded headers and
removes the bodies of host only functions, thereby eliminating the
compilation failures.
## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-11 16:22:37 +00:00
Bartłomiej Kocot
0fdbf8a91d [rocm-libraries] ROCm/rocm-libraries#8272 (commit 1c66ecb)
[CK] Padding on K for global load for grouped conv bwd data
 (#8272)

## Motivation

Fix incorrect results caused by lack of padding during global load in
grouped convolution backward data kernel. It is needed since there is no
OOB check for global load.

## Technical Details

Add padding needed for global load which not use OOB check.

## Test Plan

test_grouped_convnd_bwd_data*

## Test Result

Passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-11 15:28:21 +00:00
BrianHarrisonAMD
f0545b5c15 [rocm-libraries] ROCm/rocm-libraries#8132 (commit 57d21a1)
[CK dispatcher] - LGBM predict data_type FLOAT32->FLOAT64 in
 ml_heuristic (#8132)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary
`ml_heuristic.hpp` calls `LGBM_BoosterPredictForMat(...,
/*data_type=*/0, ...)` (`C_API_DTYPE_FLOAT32`) against a
`std::array<double, NUM_FEATURES>` feature buffer. LightGBM reinterprets
the 8-byte doubles as 4-byte floats → invalid predictions → the
heuristic's argmax always tie-breaks to the first/smallest enumerated
config.

**Fix:** `data_type 0 → 1` (`C_API_DTYPE_FLOAT64`), matching the
`double` buffer. After the fix, predictions vary and track real TFLOPS
(the model correctly prefers larger tiles).

## Verification
- The feature buffer `f` is `std::array<double, NUM_FEATURES>`
(NUM_FEATURES = 72) → `f.data()` is a `double*`.
- The changed `0` is the 3rd positional `data_type` argument (not
`nrow`/`ncol`/`is_row_major`).

One-line functional change.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 18:57:31 +00:00
Illia Silin
a433424e08 [rocm-libraries] ROCm/rocm-libraries#8241 (commit cd183df)
[CK] increase time limit for fmha_bwd tests to prevent
 timeouts (#8241)

## Motivation

Observed a CI failure due to fmha_bwd test timeout which never happened
before. Going to increase the time limit for the test to prevent any
further CI failures.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-10 15:37:44 +00:00
Ville Pietilä
c6c55db757 [rocm-libraries] ROCm/rocm-libraries#8019 (commit 6472935)
[CK TILE] Fix performance regression caused by Dispatcher
 codegen compiler flag. (#8019)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Currently CK Tile two codegen paths: CK Builder and CK Tile Dispatcher.
The CK Tile Dispatcher codegen uses an additional compiler flag that is
not present in the CK Builder codegen workflow. The additional compiler
flag can cause performance regression for so instances as it disables
relevant compiler optimizations.

## Technical Details

Removed compiler flag `-mllvm -enable-noalias-to-md-conversion=0` from
the CMakeLists.txt that creates instance library from Dispatcher
codegen.

## Test Plan

Required testing is contained in the CI/CD pipeline.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-10 09:12:03 +00:00
Chao
320a813d67 [rocm-libraries] ROCm/rocm-libraries#6533 (commit 5dcaa45)
[CK_TILE] Add host-side Pack-GQA optimization for FMHA
 forward (#6533)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[CK_TILE] Add host-side Pack-GQA optimization for FMHA forward

## Motivation

Host-side Pack-GQA optimization for CK-Tile FMHA forward. Reshapes Q
tensor
from `[b, nhead_q, seqlen_q, d]` to `[b, nhead_kv, nhead_ratio *
seqlen_q, d]`
by adjusting strides, so grouped Q-heads sharing the same KV data are
processed
in a single tile. Zero kernel changes — runner-only.

Phase 1: non-causal attention with GQA ratio packing.
Phase 2: extends to dropout and split-kv paths, fixes stride edge cases.

## Technical Details

Modified files (2):
- `example/ck_tile/01_fmha/example_fmha_fwd.cpp` — Pack-GQA flag
plumbing
- `example/ck_tile/01_fmha/fmha_fwd_runner.hpp` — Q tensor reshape
logic,
  stride adjustment for GQA ratio packing

New files (1):
- `example/ck_tile/01_fmha/test_pack_gqa_phase2.sh` — 53 test cases
covering
  non-causal, dropout, split-kv, various GQA ratios

## Dependencies

None — this PR is standalone.

## Test Plan

- GPU validation on MI300X (gfx942, ROCm 6.4.1):
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048
-d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- GPU validation on MI350X (gfx950, ROCm 7.0), 53 parameterized test
cases:
- Command (GQA 4:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=32
-h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- Command (GQA 8:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=64
-h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- Command (decode): `./build/bin/tile_example_fmha_fwd -b=64 -h=32
-h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1
-repeat=3`

## Test Result

Benchmark results (MI350X, gfx950, ROCm 7.0):

| Config | Without Pack | With Pack | Improvement |
|--------|-------------|-----------|-------------|
| GQA 4:1 prefill b=2 h=32 hk=8 s=2048 d=128 bf16 | 690.05 TFlops (0.199
ms) | 695.61 TFlops (0.198 ms) | +0.8% |
| GQA 8:1 prefill b=2 h=64 hk=8 s=2048 d=128 bf16 | 706.25 TFlops (0.389
ms) | 729.35 TFlops (0.377 ms) | +3.3% |
| GQA 8:1 decode b=64 h=32 hk=4 s_k=4096 d=128 bf16 | 305.20 GB/s (1.763
ms) | 1813.41 GB/s (0.297 ms) | **+5.9x** |
| LLaMA-70B decode b=32 h=64 hk=8 s_k=4096 d=128 bf16 | 591.70 GB/s
(0.909 ms) | 1820.65 GB/s (0.295 ms) | **+3.1x** |
| MHA ratio=1 b=2 h=8 s=4096 d=128 bf16 | 695.16 TFlops | 702.72 TFlops
| no regression |

Benchmark results (MI300X, gfx942, ROCm 6.4.1):

No regression on MI300X. Pack-GQA is a runner-only optimization (zero
kernel changes), performance impact is within noise on MI300X.

| Config | TFlops / GB/s | Time (ms) | Delta vs baseline |
|--------|-------------|-----------|-------------------|
| MHA bf16 b=2 h=8 s=4096 d=128 | 336.52 TFlops | 0.408 | -1.7% |
| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 | 322.52 TFlops | 0.426 |
-0.7% |
| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 | 349.85 TFlops | 0.786 |
+0.5% |
| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 | 381.29 TFlops |
1.442 | +1.2% |
| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 | 697.32 GB/s | 1.541 |
+0.8% |

All validation tests pass (`valid:y`) on both MI300X and MI350X.

Additional validation:
- 53 parameterized test cases pass (23 phase 1 + 30 phase 2)
- GQA ratios tested: 1:1, 2:1, 4:1, 8:1, 32:1
- No regression on MHA (ratio=1) workloads
- fp16 and bf16 validated
2026-06-10 01:56:44 +00:00
Bartłomiej Kocot
928b46c3bd [rocm-libraries] ROCm/rocm-libraries#8208 (commit 7240d71)
[CK] Fix scale init in profile_grouped_conv_fwd_outelementop
 (#8208)

## Motivation

Wrong scale initialization caused random errors on CI.

## Technical Details

InvScale was initialized by 0 what caused nans during division. At now
zero are excluded from randing.

## Test Plan

TestGroupedConvndFwdConvInvscale3d

## Test Result

Passed in 100 runs

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-1400
2026-06-09 21:42:53 +00:00
Bartłomiej Kocot
cb099eb963 [rocm-libraries] ROCm/rocm-libraries#8155 (commit c25787b)
[CK] Magic division for long_index_t

## Motivation

Improve performance for long_index_t kernels

## Technical Details

Support magic division for long_index_t

## Test Plan

test_grouped_convnd*

## Test Result

Passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-1386
2026-06-09 20:57:20 +00:00
Aviral Goel
93e0d79103 [rocm-libraries] ROCm/rocm-libraries#8035 (commit 45186b8)
[CK_Tile] Add wmma_bf16f32_16x16x32_bf16 warp-gemm test
 (#8035)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Adds the warp-gemm unit test for `wmma_bf16f32_16x16x32_bf16`. **Stacked
on #8028** (the API change) and based on its branch, so #8028 shows the
isolated API diff and this PR shows just the test.

## Test

gfx125-guarded `WmmaBf16f32.ResidualPrecisionContrast`: computes `Y_bf16
= X_bf16·W_bf16 + R_fp32` via `WarpGemm::mac_downconvert`, compares
against an fp32 reference (within bf16 tolerance), and asserts it is at
least as accurate as the bf16-accumulate path — i.e. it demonstrates the
precision benefit of the fp32 accumulator (`C`) carried into the fused
bf16 down-convert.

Passes on gfx1250.
2026-06-09 15:52:45 +00:00
chris-tsiaousis-hpc
dc3c1cffd5 [rocm-libraries] ROCm/rocm-libraries#7891 (commit 4dee41d)
Porting existing FMHA infra from users/shumway/ck/exp-kpack
 to develop (#7891)

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2026-06-09 14:00:32 +00:00
Brock Hargreaves
65c395984d [rocm-libraries] ROCm/rocm-libraries#8108 (commit c620f0a)
[ck] Unify Build_CK and buildHipClangJob into buildAndTest
 (#8108)

## Motivation

`projects/composablekernel/vars/ck.groovy` had two near-identical build
functions, `buildHipClangJob` (lean: static checks, FMHA, tile-engine,
conv) and `Build_CK` (main per-arch matrix). This removes the
duplication and fixes a latent GitHub-status bug that lived in both.

  ## Technical Details

- Merged both into one `buildAndTest(Map conf)` gated by an explicit
`is_main_build` flag (default `false` = lean path; `true` adds the GPU
check + arch-gated inductor/perf/hipTensor; only `runBuildCKAndTests`
sets it).
- Deleted the `Build_CK_and_Reboot` / `buildHipClangJobAndReboot`
wrappers (they only logged and re-threw); all 13 call sites now call
`buildAndTest` directly.
- Widened the shared `catch` to `Exception` so build / image-pull / "GPU
not found" failures report **failure** instead of leaving the check
stuck **pending** (failing stages now go red).
- Removed the dead `no_reboot` key. No change to what is built or
tested.

  ## Test Plan

  - Jenkins linter on the `Jenkinsfile`.
- One branch run covering both paths (per-arch matrix + lean stages),
spot-checking gfx1250 and a nogpu stage.

  ## Test Result

- Verified statically: no `buildHipClangJob*` / `Build_CK*` references
remain; `buildAndTest` defined once, all call sites wired.
  - Pending: linter + branch run before merge.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-08 23:45:42 +00:00
Emily Martins
97ca00e449 [rocm-libraries] ROCm/rocm-libraries#7836 (commit cdd9958)
[CK Tile] Stream-K RDNA Support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Currently, CK Tile Stream-K only supports CDNA architectures. This
change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures.

## Technical Details
Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear,
and 3) tree. The linear and tree reductions require inter-workgroup
communication to a global flags buffer and a global partials buffer. To
ensure cache coherency, we use cache modifiers to skip cache levels that
are not visible to all workgroups. On CDNA architectures, scalar load
and scalar store instructions are available, which we use to read and
write to the flags buffer with appropriate cache skipping modifiers.
However, RDNA architectures do not support scalar store instructions, so
workgroups must use a buffer store instruction to write to flags.
Additionally, cache modifiers differ between CDNA and RDNA; they also
differ between RDNA3 and RDNA4. Given this information, the main changes
are as follows:
- Added RDNA flag signaling: Use buffer store instructions for writing
to global flags buffer
- Add appropriate cache modifiers for reading and writing to flags and
partials:
   - RDNA3 (gfx11): Use `glc | dlc` coherence flags
   - RDNA4 (gfx12): Use `DEVICE` coherence scope
- SFINAE-guarded overloads: Added compile-time dispatch for
`SignalStorePartialDone()` and `WaitStorePartialDone()` based on target
architecture
- RDNA alignment requirements: Increased flags buffer alignment from
128B to 256B due to RDNA cache line size

**A note about the `amd_buffer_coherence_enum`:**
- **Problem:** The `amd_buffer_coherence_enum` uses preprocessor
conditionals (`#if defined(__gfx12__)`) to define architecture-specific
values. Template specializations reference enum values from different
architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name
lookup, non-dependent names are resolved during template parsing
regardless of which architecture is being compiled, causing compilation
failures when referenced values do not exist in the active preprocessor
branch.
- **Temporary Solution**: Added compatibility enum values to each
architecture block. For example, I added `glc_dlc` in the `__gfx12__`
block. I will create a ticket to refactor this enum with a design that
has better scalability and tries to avoid the use of preprocessor
conditionals.

## Test Plan
### Summary
gtests were added to test wmma variants of Stream-K. These tests were
stressed tested locally on gfx11 and gfx12.
### More details
This PR makes the following changes/additions to the Stream-K gtests:
- Split tests into MFMA (CDNA) and WMMA (RDNA) variants
- Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction
- WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size
supported on RDNA)
- Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU
count
- As described in [HIP
documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898),
when in Workgroup Processor (WGP) mode, the value of
`hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single
WGP contains two CUs. The default mode on RDNA is WGP mode, so when
creating (M, N, K) instances for gtests using the CU count, we need to
multiply the CU count by 2 to get the correct value. This is not needed
in the kernel host code, because the occupancy ensures that overall
`max_active_wgs` is correct.
## Test Result

All tests pass locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-08 22:48:10 +00:00