289 Commits

Author SHA1 Message Date
Kiefer van Teutem
2089713f94 [rocm-libraries] ROCm/rocm-libraries#8227 (commit 75c30d5)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Remove=20unification=20Flag=20structs=20in=20favor=20of=20new?=
 =?UTF-8?q?=20WarpGemmParams=20(#8227)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Recently, the way flags are sent down to the intrinsics was changed in
CK Tile. At the point where the WarpGemm is invoked, an arbitrary number
of template parameters can be passed, and these are passed down all the
way to the lowest level intrinsics wrappers. Here
`WarpGemmParamsParser<>` is used to extract flags for the intrinsics.

In this MR we adapt the the unification framework (amdgcn_mma struct and
MmaPipelines) to work in the same way. By doing this, there is no longer
a point in our custom intrinsic Flag structs, so these are removed.

Unrelated but I also tried removing the MmaPipeline flags because they
arn't used for anything except CTranspose, which is already available.
This also make test_amdgcn_mma_pipeline completely redundant so removed
that as well.
2026-06-26 12:00:58 +00:00
Kiefer van Teutem
137f2a9a10 [rocm-libraries] ROCm/rocm-libraries#7407 (commit 0b79e05)
[CK TILE] Initial integration of MFMA / WMMA unification
 framework into CK Tile (#7407) (locked behind flag)

Note: Everything works but this is still a draft MR because I want to do
some more cleanup and maybe do some testing for MX fp6. Also please
don't trigger copilot, I will do this once I feel it is clean enough,
otherwise I'll get a bunch of comments about stuff I already know.

## Motivation
The point of this MR is to finally use our unification MmaPipelines to
replace the existing WarpGemms in CK tile and make sure everything
works. I focused on gfx908 and gfx950 for now, dense and scale
intrinsics, fp16, fp8, and fp4. I managed to get CK tests / examples
working for all of these scenarios, so the basic implementation should
be correct. I expect some more tweaks will be required to get full
support, some of which I already anticipated in the section "New
issues".

## Big switch: USE_NEW_UNIFIED_FRAMEWORK
When USE_NEW_UNIFIED_FRAMEWORK is 1, we replace all WarpGemms with
MmaPipelines from the new unified framework. This means
WarpGemmDispatcher will use the UnificationDispatcher instead of the
regular Dispatcher. Furthermore, named WarpGemms like
WarpGemmMfmaF32F32F32M16N16K4 will also get rerouted to the
UnificationDispatcher. The latter is necessary because some pipelines
bypass the WarpGemmDispatcher in favor of directly using named
WarpGemms.

For now the switch is turned on for easier testing, so don't expect the
CI to pass. When off, this MR should not affect any of the CK tile tests
at all so I *would* expect the CI to pass.

## Simplification of MmaPipelineBase
I found that the structure of MmaPipelineBase was a bit complex and I
was able to reduce it a lot. The only thing an Mma Pipeline does
(currently) is provide a wrapper around amdgcn structs that allows k
iteration and sparse compression. We don't allow M and N composition for
now for simplicity and since this is not expected from WarpGemms in CK
Tile currently.

## Re-interpretation of tile distribution encodings for packed datatypes
Tile distributions for packed types are expected to describe
mathematical elements, not datatype elements! This distinction is why
the gfx950 fp4 CK_tile tests were not working. Updated the
interpretation in amdgcn_mma, tile distribution calculator, and layout
test, along with comments. Tested on all architectures.

## getCMakeCompilerTarget() for configuration time target architecture
This is a workaround because there are a lot of cases in CK Tile where
the host code inspects Device constructions like WarpGemm, and we need
to get the version that *will* be used on the device. This is a big
kludge and we need to figure out a better solution. Also this util will
always pick the *first* cmakelists target arch, so there will be issues
when compiling for multiple target architectures. Ideally, the host code
should not touch the WarpGemms at all, and there would be no issue. This
has been a point of friction in CK for a long time. We can discuss this
with Chris Millette.

## Tests
I was able to verify that the following CK Tile tests and examples work
with the new unified framework:
tile_tutorial_mfma_16x16x16 (gfx9, fp16, uses transpose)
tile_example_gemm_basic (gfx9, fp16)
test_ck_tile_mx_gemm_async (gfx950, microscaling fp8 and fp4)

Within the tile tutorial I was also able to use
WarpGemmMfmaF16F16F32M16N16K32TransposedCDistribution instead of
WarpGemmMfmaF16F16F32M16N16K16TransposedCDistribution to verify that
basic K iteration also works.

A little while ago I also verified that the performance did not change
in a measurable way, and the compile did not change *much* but did see
some swings up to 20% each way (faster or slower). We will need some
broader and more accurate tests for this going forward.

## Moving forward
To confidently be able to replace the existing Dispatcher and WarpGemm
framework with our own, we need to make sure that all existing tests and
examples work on all platforms. Furthermore, we should pay attention to
performance and compile time of all these tests. Performance should
definitely not change, as all we're doing is refactoring the support
structure around the intrinsics, which should melt away during
compilation.

## New issues
(I will make new issues with descriptions for these but here is a short
list (incomplete):

Test RDNA CK Tile pipelines
Test Sparse Ck Tile pipeline (does not exist but we can make one)
Remove MmaOp flags from unification framework and update it to work with
new WarpGemmParamsParser instead.
Add Swizzle support and test in CK Tile pipelines.
Test Scale + transpose Ck Tile pipelines.
Coherent strategy for attrnumaccess for dense, scale, default, packed,
wmma, gfx1250, etc in CK tile. It's messy now.
Dispatcher should not be determining scale-ness of intrinsics based on
MNK sizes.
Try adding back the MN composition in MmaPipelines
Why is test_amdgcn_wavewise_mma only compiled for CDNA?
Investigate NOP and AGPR flags
Maybe get rid of WmmaTag in dispatcher.
Find a coherent strategy for dealing with host vs device compile passes,
and the host sneaking a peak at WarpGemm internals. Related to
getCMakeCompilerTarget().

## TODO before merge
Some changes exist just for ease of testing, and will be reverted before
merging:
- gemm_basic.cpp has a lot of datatypes disabled because otherwise
compile time is huge for testing
- USE_NEW_UNIFIED_FRAMEWORK is set to 1 for easier testing
2026-06-24 13:35:25 +00:00
Enrico Degregori
55e30feac6 [rocm-libraries] ROCm/rocm-libraries#8637 (commit a1a7f5f)
[CK] Fix compilation

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-20 02:08:58 +00:00
Enrico Degregori
2733e75900 [rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e)
[CK Tile] Async support pipeline V3

## Motivation

Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async
pipeline)

## Technical Details

- Add `Async` bool to `Problem` struct to enable async pipeline in
existing one
- Add `static_move_ys` to load transpose. This generates offset in
assembly instructions saving registers
- Add `is_valid` to `async_get_vectorized_elements`. Before hard coded
to true. It allows to support padding
- Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr`
(wider use of lds load transpose on gfx950)
- Integrate async support in existing V3 pipeline (avoid pipelines
duplication)
- Create policy to support both async and default cases. This could be
used by any async pipeline (next steps)
- Define `wg_attr_num_access` separately for A and B. This allows to
optimize ds_read instruction width for cases when one matrix is
transposed and the other is not. Before in such cases, `ds_read_b64` was
used instead of `ds_read_b128`
- Add test for V3 async. Currently only supporting cases with A and B
having the same type

## Test Plan

New test `test_ck_tile_gemm_pipeline_compv3_async`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 06:57:14 +00:00
Sami Remes
a3a12b8945 [rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf)
[CK_TILE] Enable full transpose layout support for MX GEMM
 pipeline (#5813)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Enable full transpose layout support for MX GEMM pipeline (32x32x64
MFMA)

### Summary

This PR enables all four matrix layout combinations (Row/Col, Row/Row,
Col/Col, Col/Row) for the MX GEMM pipeline with `32x32x64` MFMA warp
tiles, using `ds_read_tr` transposed LDS loads on gfx950. Previously,
only the canonical `A=RowMajor, B=ColumnMajor` layout was supported.

### Changes

**Kernel-side transpose support:**

- **`warp_gemm_attribute_mfma.hpp`**: Introduce `kSplitFactor` logic in
`get_warp_dstr_encoding` to split the K-dimension distribution encoding
when `kPerLane` exceeds the `ds_read_tr` subtile minor dimension. This
satisfies the `TransposeTileDistributionTraits` suffix validation
required by `load_tile_transpose`. The distribution encoding now also
receives the `DataType` template parameter to compute the split factor
based on packed element size.

- **`gemm_pipeline_ag_bg_cr_comp_async.hpp`**: Uncomment and enable the
`InputTileDistributionTraits` logic to properly transform LDS load tile
distributions for transposed reads. Add `static_assert`s to catch
misconfigurations where a layout requires transpose loads but the warp
tile size disables them (e.g. `KWarpTile=128` exceeds `ds_read_tr`
limits).

- **`load_tile_transpose.hpp`**: Fix `DataVec` sizing for packed types
(`pk_fp4_t`) — divide `vecLoadSize` by `PackedSize` to prevent buffer
overflow when each physical element contains multiple logical values.

- **`warp_gemm_attribute_mfma_impl.hpp`**: Set `kDefaultScale` to
`0x7F7F7F7F` (unity in e8m0 format) for the unscaled `operator()`
overloads of `WarpGemmAttributeMfmaImpl_f32_32x32x64_f8f6f4`, ensuring
correct behavior with `mfma_scale_f32_32x32x64_f8f6f4`.

- **`warp_gemm.hpp` / `warp_gemm_dispatcher.hpp`**: Add generic
`WarpGemmMfma_f32_32x32x64_f8f6f4<A, B>` alias and dispatcher
specialization to support arbitrary MX data type combinations (fp4, fp6,
fp8) with the 32x32x64 MFMA, consolidating the existing type-specific
aliases.

- **`gemm_pipeline_ag_bg_cr_comp_async_default_policy.hpp`**: Simplify
`wg_attr_num_access` determination — `Double` for fp8, `Single`
otherwise.

**Reference implementation fix:**

- **`reference_gemm.hpp`**: Fix nibble selection for packed 4-bit types
(`pk_fp4_t`, `pk_int4_t`) in `reference_mx_gemm`, `reference_gemm`, and
`reference_gemm_abquant`. The previous logic used `k % 2` or
`index[K_DIM] & 1` to select which nibble to extract, which assumed K
was always the fast (contiguous) memory dimension. This is only true for
`A=RowMajor` / `B=ColumnMajor`. For other layouts, the fix computes the
flat memory offset via `mDesc.GetOffsetFromMultiIndex(...)` and uses its
parity to correctly select the nibble regardless of layout.

**Test infrastructure:**

- **`test_mx_gemm_config.hpp`**: Add `MxGemmConfig32` base and
`MXfp4_GemmConfig32` / `MXfp8_GemmConfig32` configs for the 32x32x64
warp tile.
- **`test_mx_gemm_fp4.cpp` / `test_mx_gemm_fp8.cpp`**: Add `Config32`
test suites covering all four layout combinations. Restrict `Config16`
(16x16x128) to `A=Row, B=Col` only, since `KWarpTile=128` exceeds
`ds_read_tr` limits.
- **`test_mx_gemm_util.hpp`**: Fix scale tensor layout — scales are
always row-major `[M, K/32]` and column-major `[K/32, N]`, independent
of A/B data layout.

### Test plan

- [x] `test_ck_tile_mx_gemm_fp4` — 5/5 passed (16x16x128 Row/Col +
32x32x64 all 4 layouts)
- [x] `test_ck_tile_mx_gemm_fp8` — 5/5 passed (16x16x128 Row/Col +
32x32x64 all 4 layouts)
- [x] `test_ck_tile_mx_gemm_fp6` — 1/1 passed (16x16x128 Row/Col)
2026-06-18 17:05:09 +00:00
Enrico Degregori
1762eaeaec [rocm-libraries] ROCm/rocm-libraries#8535 (commit a0f47eb)
[CK Tile] EightWaves pipeline int8 support

## Motivation

EightWaves pipeline currently is supporting only FP types

## Technical Details

 - Enable 16x16x64 int8 instruction for gfx950 in dispatcher
 - Enable int8 in EightWaves pipeline
 - Add tests
 - Fix bug in `warp_gemm_attribute_mfma_impl.hpp`

## Test Plan

Tests have been added for int8 GEMM using EightWaves pipeline

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 12:59:59 +00:00
Ville Pietilä
60b276647b [rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)
[CK Tile] Rule-based configuration generation in CK
 Dispatcher codegen (#8157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile Dispatcher code generation for CK Tile Profiler relies on
flat JSON files to list the generated configurations. This approach has
the following problems

- The JSON files are verbose
- The JSON files get easily out of sync with the CK Builder .config
files from which they were generated from.
- The JSON file based configuration make it hard to list explicitly the
rules that govern the instance generation.

## Technical Details

Replaced the JSON files with a rule based configuration. To preserve the
existing functionality, the `profiler` and the `tests` instance sets are
generated directly from the CK Builder config files. The JSON config
files are removed from source control, and the "on-the-fly" generation
guarantees that the Dispatcher codegen uses up to date configurations.

This is PR introduces six different rule sets for the CK Tile Dispatcher
code generation

1. `profiler`: matches with the old JSON set of profiler configurations.
2. `tests`: matches with the old JSON set of tests configurations.
3. `full`: full configuration set created from a rule-based config
selection
4. `full-tests`: a subset of `full` for generating configurations for
convolution integration tests.
5. `tiny`: a subset of `full-tests` to produce the minimal set of
configurations to test the Dispatcher codegen.
6. `default`: the default rules, which corresponds to the existing
heuristic rules for configuration selection. This ensures that ML based
kernel selection doesn't get broken.

The main use of the `full` rule set is to define a reasonable solution
space for the possible implicit GEMM configurations. We start from the
configurations that allowed by the device architecture. The `full` rule
set defines the relevant tile sizes for each convolution direction. From
the tile size we have a curated mapping to the number of waves over the
different GEMM axes, i.e., we describe how many waves each GEMM
dimensions corresponds to. The GEMM-K wave tile dimension can be
computed from the other parameters and does not need to be listed
explicitly.

An orthogonal axis to the tiling strategy is the vectorization strategy.
This mainly defined by the data type and hardware as in general, we want
to use the maximum possible load widths. The maximum sizes for each
convolution direction variant are defined by the implicit GEMM matrix
dimensions. For cases where have a low number of channels per
convolution group, we need smaller vector load sizes. These are captured
by the `VecStrategy` enumeration in the codegen rules.

The problem with the rule based configuration selection is that we "over
generate" configurations. The old JSON configurations compose
approximately 25% of all configuration that the `full` rule set creates.
The additional configurations are valid, but they many not provide any
performance benefits. Hence, we keep the `profiler` and `tests` rule set
for now to avoid building an excessive amount configurations by default.
The `full` rule set can be taken into use by specifying CMake
configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the
`tests` rule set is used, i.e., we don't change the existing bahaviour.

## Test Plan

Added a new stage in the CI/CD pipeline that ensures the Dispatcher
codegen rules are up to date. Otherwise the functionality is covered by
the existing CI/CD tests. There are no functional changes to the
convolution kernels. Only how the different instances are generated.

## Test Result

If the CK Tile conv instances build without errors, the Dispatcher
codegen is generating valid code. If all tests in CI/CD pipeline are
passing, the Dispatcher codegen generates valid instances.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 01:22:50 +00:00
Aviral Goel
c43b550206 [rocm-libraries] ROCm/rocm-libraries#8202 (commit 0911fa0)
[GFX1250][CK_TILE] Add scale16 (ScaleBlockSize=16) support to
 MX GEMM TDM pipeline (#8202)

Enables `ScaleBlockSize=16` end-to-end for the FP8/BF8 MX GEMM TDM
pipeline, building on the scale16 warp-gemm layer already in develop.

- **warp gemm:** add the 32x32x128 f8f6f4 scale16 traits and alias (2x2
grid of 16x16x128 scale16 intrinsic calls with per-subtile
`SCALE_OPSEL`), and route 32x32 f8f6f4 through the dispatcher's
`IsScale16` path.
- **default policy:** select the warp gemm via the dispatcher with
`IsScale16=(ScaleBlockSize==16)` so `WarpTile=16` and `WarpTile=32` each
pick the matching scale16 path; guard WarpTile M/N to 16 or 32;
scale-tile distribution for the scale16 layout.
- **pipeline V1/V2:** thread `Problem::ScaleBlockSize` through the
scale-window setup (replacing the hardcoded 32); expose `ScaleBlockSize`
for the kernel.
- **block gemm:** extract int64 (scale16) / int32 (scale32) scales by
width.
- **kernel:** scale16 descriptor order; reject unsupported
`BlockScaleSize`.

Test coverage for this path is in the stacked follow-up PR.
2026-06-17 16:41:00 +00:00
jefyang1
276863ca87 [rocm-libraries] ROCm/rocm-libraries#8259 (commit df03f10)
Add cluster launch in test ck_tile mx gemm tdm wmma

## Motivation

Add cluster launch test in test_ck_tile_mx_gemm_pipeline_tdm_wmma on
gfx1250, so that we can check the performance on gfx1250 hardware.

## Technical Details

Added Out-of-bounds guard in RunGemm of MxGemmKernel to skip blocks
padded by cluster alignment.

Add ClusterEnable/ClusterDisable aliases and extend the tuple in
test_mx_gemm_pipeline_kernel_types.hpp by adding two kernel types with
ClusterEnable for F8 CompTDMV1 and CompTDMV2 respectively. The existing
F4 non-ClusterLaunch kernel types have issue to be fixed, so this PR
does not include F4 cases.

Read ClusterLaunch from the tuple in test_mx_gemm_pipeline_util.hpp.

Update invoke_mx_gemm to branch on ClusterLaunch, including Add cluster
size constants, Switch GemmShape type, TilePartitioner type, and the
kernel launch call.

## Test Plan

Tested the changes on gfx1250 FFM.

## Test Result

The added kernel types (instances) passed the tests on gfx1250 FFM.

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-11 17:33:11 +00:00
Emily Martins
97ca00e449 [rocm-libraries] ROCm/rocm-libraries#7836 (commit cdd9958)
[CK Tile] Stream-K RDNA Support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Currently, CK Tile Stream-K only supports CDNA architectures. This
change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures.

## Technical Details
Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear,
and 3) tree. The linear and tree reductions require inter-workgroup
communication to a global flags buffer and a global partials buffer. To
ensure cache coherency, we use cache modifiers to skip cache levels that
are not visible to all workgroups. On CDNA architectures, scalar load
and scalar store instructions are available, which we use to read and
write to the flags buffer with appropriate cache skipping modifiers.
However, RDNA architectures do not support scalar store instructions, so
workgroups must use a buffer store instruction to write to flags.
Additionally, cache modifiers differ between CDNA and RDNA; they also
differ between RDNA3 and RDNA4. Given this information, the main changes
are as follows:
- Added RDNA flag signaling: Use buffer store instructions for writing
to global flags buffer
- Add appropriate cache modifiers for reading and writing to flags and
partials:
   - RDNA3 (gfx11): Use `glc | dlc` coherence flags
   - RDNA4 (gfx12): Use `DEVICE` coherence scope
- SFINAE-guarded overloads: Added compile-time dispatch for
`SignalStorePartialDone()` and `WaitStorePartialDone()` based on target
architecture
- RDNA alignment requirements: Increased flags buffer alignment from
128B to 256B due to RDNA cache line size

**A note about the `amd_buffer_coherence_enum`:**
- **Problem:** The `amd_buffer_coherence_enum` uses preprocessor
conditionals (`#if defined(__gfx12__)`) to define architecture-specific
values. Template specializations reference enum values from different
architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name
lookup, non-dependent names are resolved during template parsing
regardless of which architecture is being compiled, causing compilation
failures when referenced values do not exist in the active preprocessor
branch.
- **Temporary Solution**: Added compatibility enum values to each
architecture block. For example, I added `glc_dlc` in the `__gfx12__`
block. I will create a ticket to refactor this enum with a design that
has better scalability and tries to avoid the use of preprocessor
conditionals.

## Test Plan
### Summary
gtests were added to test wmma variants of Stream-K. These tests were
stressed tested locally on gfx11 and gfx12.
### More details
This PR makes the following changes/additions to the Stream-K gtests:
- Split tests into MFMA (CDNA) and WMMA (RDNA) variants
- Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction
- WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size
supported on RDNA)
- Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU
count
- As described in [HIP
documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898),
when in Workgroup Processor (WGP) mode, the value of
`hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single
WGP contains two CUs. The default mode on RDNA is WGP mode, so when
creating (M, N, K) instances for gtests using the CU count, we need to
multiply the CU count by 2 to get the correct value. This is not needed
in the kernel host code, because the occupancy ensures that overall
`max_active_wgs` is correct.
## Test Result

All tests pass locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-08 22:48:10 +00:00
Bartłomiej Kocot
2c363870d9 [rocm-libraries] ROCm/rocm-libraries#6744 (commit 9d056e8)
[Ck][CK Tile] Global Load/Store for Large Tensors support
 (#6744)

## Motivation

Create solution to support large tensors in the entire ck tile.

## Technical Details

- add possiblity to use global load
- int64 indexing

## Test Plan

conv fwd tests

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-913
2026-06-06 10:14:17 +00:00
Enrico Degregori
1b4fbd95fd [rocm-libraries] ROCm/rocm-libraries#6089 (commit c876d18)
[CK Tile] Extend type support EightWave pipeline

## Motivation

EightWave pipeline was designed for 8 bit types. This PR extend support
for any FP type

## Technical Details

 - Generalize policy to support any FP type
- Change LDS layout to fix bank conflicts. This removes all bank
conflicts in the pipeline (checked for all supported types). Remaining
bank conflicts are related to Cshuffle epilogue.

## Test Plan

Added GEMM tests with new supported types. Note that FP6 is also
supported for MX GEMM but the PR was reverted so no tests were added for
it.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 23:54:40 +00:00
Yung-sheng Tu
e826b2eb7e [rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

This PR adds two specialisations related to `tf32_t`.

## Technical Details

This change treats `tf32_t` as a concrete type rather than an empty
`struct`. It also adds two new specialisations for MFMA dense builtins
and resolves existing circular include issues.

## Test Plan

All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 12:27:41 +00:00
Enrico Degregori
7b9245f18c [rocm-libraries] ROCm/rocm-libraries#5854 (commit 8e2d46d)
[CK Tile] Async support preshuffle GEMM

## Motivation

Add async support to existing preshuffle GEMM pipeline

## Technical Details

Notes:
the implementation avoids previous strategy of duplicating pipelines for
async support and instead add a switch `Async` to the ops Problem to
enable async pipeline. Then, integrate the async pipeline in the
existing one. This allows to avoid code duplication and facilitate the
integration of buffer load to lds in existing pipelines. In my opinion,
it should be used also for other pipelines which don't support buffer
load to lds yet and it would also be a good idea to refactor the
existing async GEMM pipelines with the same approach.

Summary:

 - integrate buffer load to lds in existing pipeline
- add optimal tensor descriptors for vmem loading and lds reading. They
are currently optimized for 16x16 wave tiles but they also work for
32x32 wave tiles. Optimizations for 32x32 wave tile requires different
lds layout and it will be done in a follow-up issue
 - Add async config to examples
 - Add test (gfx950 only)

## Test Plan

New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 07:17:09 +00:00
Aviral Goel
267ca67001 [rocm-libraries] ROCm/rocm-libraries#8028 (commit c1cb112)
[CK_Tile] Add wmma_bf16f32_16x16x32_bf16 via
 fused-downconvert override (#8028)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Adds `__builtin_amdgcn_wmma_bf16f32_16x16x32_bf16` (fp32 accumulate →
bf16 output) to the CK Tile WMMA warp-gemm path. **API only** — the unit
test is split into a stacked PR (#8035) so this API change can be
reviewed in isolation.

## Changes (4 files)

- **16-bit trait:** `wmma_intrinsic_downconvert` (calls the bf16f32
builtin — fp32 C in, bf16 C out) plus `COutDataType = bf16_t` /
`COutVecType`.
- **`WarpGemmAttributeWmmaImpl` / `WarpGemmAttributeWmma`:**
`mac_downconvert(c_fp32, a, b)` (kTransC-aware) returning the bf16
C-output vector.
- **`WarpGemmImpl`:** `mac_downconvert` tail handler producing a bf16
C-output tile from the fp32 accumulator tile, reusing
`CWarpDstrEncoding` (output layout identical to the f32 C tile).

Verified on gfx1250 (via the stacked test PR #8035): the test passes;
the existing WMMA warp-gemm test is unaffected (additive change only).
2026-06-05 05:01:31 +00:00
John Afaganis
96c39b331e [rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)
[ck] Enforce ASCII-only C/C++ sources for hipRTC
 compatibility (#7829)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

CK source files must be compilable via **hipRTC (HIP runtime
compilation)**, whose preprocessor does not accept non-ASCII bytes
anywhere in a translation unit — **including in comments**. Bytes that
are harmless under `hipcc` (em-dashes, smart quotes, multiplication
signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at
preprocessing time. These regularly leak in via LLM-assisted authoring
or copy/paste from formatted documents and silently break hipRTC paths
that are not exercised by the default `hipcc`-based build matrix.

This PR (a) cleans every existing violation (53 files) and (b) adds a
pre-checkin gate so new violations are rejected before merge.

## File extensions covered

Both the cleanup scan and the new Jenkins enforcement stage use the same
predicate:

```
*.h  *.hpp  *.cpp  *.h.in  *.hpp.in  *.cpp.in  *.inc  *.cl
```

(excluding `*/build/*` and `*/include/rapidjson/*`). This is a strict
superset of the existing `Clang Format` stage's predicate — `*.inc` is
added so test-fixture include files are also gated. The local pre-commit
hook's `c++/inc` type filter covers the same set.

## Why no enforcement today

CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so
the existing `pre-commit` workflow doesn't touch CK. The local CK
`.pre-commit-config.yaml` only runs for developers who installed hooks.
The **authoritative gate is therefore the new Jenkins stage** in this
PR; the local hook is convenience.

## Commit layout (bisect-friendly)

1. `79798aa6261` — **`[ck] Convert reflect/ rendering to ASCII for
hipRTC compatibility`**
Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `|-
/ +- / | ` (3-col width preserved so alignment is unchanged).
`conv_description.hpp` swaps `×` for `x` as the dimension separator.
`test_conv_description.cpp` expected strings updated in lockstep so the
snapshot test stays green. This is the only commit in the series with
observable runtime impact.

2. `738fdb0d81c` — **`[ck] Strip non-ASCII bytes from C++ sources for
hipRTC compatibility`**
Mechanical text cleanup across 53 files. Replacements happen in comments
or in `std::cout` strings that are not asserted on by any test. None of
the 174 `.inc` files in the tree required edits, but they were in the
scan's predicate so the enforcement stage's predicate is a superset of
what was scanned. Full replacement table in the commit message.

3. `1d7cd8ba235` — **`[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility`**
- New `projects/composablekernel/script/check_ascii_only.sh` (modeled on
`check_copyright_year.sh`).
- New entry in `projects/composablekernel/.pre-commit-config.yaml` under
the local-hooks block (`types_or: [c++, inc]`).
- New `ASCII Only Check` parallel stage in
`projects/composablekernel/Jenkinsfile`'s `Static checks` block,
mirroring the existing `Clang Format` stage but with `*.inc` added to
the find predicate. Always-on, no `RUN_CPPCHECK` gate.

The tree is buildable at every commit boundary. Commit 1 leaves 50 known
violations; commit 2 leaves 0; commit 3 wires the gate.

## Demo

Script output on a synthesized violation:

```
$ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp
$ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp
ERROR: /tmp/bad.cpp contains non-ASCII bytes:
1:// em-dash test — here
  Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.)
$ echo $?
1
```

Full repo scan after the cleanup commits (note the `-name '*.inc'`
clause):

```
$ cd projects/composablekernel && find . -type f \( -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \
    -o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' \) \
    -not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \
  | xargs -0 -P 8 -n 64 script/check_ascii_only.sh
$ echo $?
0
```

## Test plan

- [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check`
stage runs green over the full predicate (incl. `*.inc`) and existing
`Clang Format` stage is unaffected.
- [ ] `test_conv_description` passes against the ASCII tree-formatter
output (touched in commit 1).
- [ ] Local: `pre-commit run ascii-only-checker --all-files` runs
cleanly after installing CK pre-commit hooks via
`script/install_precommit.sh`.
- [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file,
push: confirm Jenkins fails the new stage with a clear error.
- [ ] Spot-check a representative subset of touched files under hipRTC
compilation to confirm no remaining hipRTC-blocking content (optional,
since the static byte check is a sufficient condition for hipRTC
preprocessor acceptance on this dimension).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-04 15:00:17 +00:00
Copilot
4fcd73a98e [rocm-libraries] ROCm/rocm-libraries#7974 (commit 9df2c76)
composablekernel: remove stray *.hpp.bk backup artifacts
 (#7974)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Four `*.hpp.bk` files were accidentally committed to
`projects/composablekernel/`, likely as leftovers from a prior merge or
conflict resolution. Each is an older snapshot of its `.hpp` counterpart
— the canonical `.hpp` files are newer and contain the correct current
content.

## Deleted files

| File | vs. `.hpp` counterpart |
|---|---|
| `ck_tile/core/tensor/tile_window.hpp.bk` | Older version: uses legacy
`bool isL1Cache`/`PrefetchL1` template params; missing
`DataCachePrefetchKind`-based prefetch API and `data_cache_prefetch.hpp`
include |
| `ck_tile/core/tensor/load_tile_transpose.hpp.bk` | Older version:
missing `#if defined(__gfx950__)` guard and `Quad` struct (~90 lines)
for gfx1250 architecture |
| `ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp.bk` | Older version:
missing `WmmaTag`, `IsScale16` template param, and several newer
dispatcher specializations |
|
`ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp.bk`
| Older version: `KPackA`/`KPackB` (since renamed `KPack`); uses
`static_ford` (since refactored to nested `static_for`) |

## Verification

- No other `.bk` files exist in `projects/composablekernel/`.
- No build scripts, CMake files, includes, or documentation reference
these `.bk` files.
- No `.hpp` files were modified.
2026-06-04 03:06:43 +00:00
Aviral Goel
e01603bc31 [rocm-libraries] ROCm/rocm-libraries#7725 (commit eef7e12)
[GFX1250][CK_TILE] Add scale16 warp gemm unit tests

## Summary
- Add scale16 WMMA intrinsic overloads and int64_t forwarding to warp
gemm layers for gfx1250
- Add comprehensive wave-level unit tests for scale16 warp gemm
(16x16x128 and 32x32x128 tile sizes)
- Test all fp8/bf8 type combinations and TransposeC variants
- Fix WarpGemm wrapper for non-uniform scale16 configurations

Stacked on #7724 (FillUniformScaleDistribution / MX GEMM scale init).
Pipeline enablement follows in the next PR.
2026-06-03 22:05:29 +00:00
Aviral Goel
99ab4c4ef7 [rocm-libraries] ROCm/rocm-libraries#7830 (commit 590fe58)
[CK_Tile][MI450] Add bf16 output wmma instruction (16x16x32)
 (#7830)

Wire __builtin_amdgcn_wmma_bf16_16x16x32_bf16 into CK Tile for gfx1250,
enabling bf16-input bf16-output WMMA at the warp GEMM level.

- Add WmmaTraits specialization for <gfx125_t, bf16, bf16, bf16,
16,16,32>
- Add WarpGemmAttributeWmmaImpl typedef and WarpGemmWmma alias
- Add Dispatcher entry for bf16->bf16 16x16x32
- Add warp_gemm test with reference GEMM validation

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-02 13:54:16 +00:00
Johannes Graner
b7c8fb164f [rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d)
[CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight
 instances (#7937)

## Motivation

CK Tile had no pipeline competitive with old CK's wavelet on the
RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a
wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive
kernel for spatial small-K shapes.

## Technical Details

- New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`):
workgroup split into math waves (LDS read + MFMA) and load waves (DRAM
read + LDS write).
- VGPR role-split: `operator()` has two top-level mutually-exclusive
`is_math` branches so the allocator overlays both roles onto the same
physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy.
Correctness depends on identical `block_sync_lds` counts on both arms
plus a matching load-wave barrier stub in the epilogue
(`cshuffle_epilogue.hpp`).
- Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`):
`kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub.

Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36
C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27%
(googlenet K=320 by 16-23%); the role-split roughly halves the parity
gap vs old CK on the 13x13 fp16 shape.

## Test Plan

- `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16,
`split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a
broad 2D sweep.
- Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64}
(barrier-parity / deadlock check).
- `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet
instances.

## Test Result

- All wavelet instances CPU-verify correct across the split-K sweep; no
hangs (dual-arm barrier sequence matches).
- Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best
non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better
vs old CK on the majority of spatial shapes.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-02 08:51:17 +00:00
Tianyuan Wu
22a99f97e8 [rocm-libraries] ROCm/rocm-libraries#7677 (commit 308af93)
[CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile

## Motivation
This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and
improves warp GEMM unit test coverage/structure for gfx1250-specific
cases.

## Technical Details

- Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added
IsScale16 plumbing to warp GEMM dispatcher path
- Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250
coverage to verify all F4 WMMA paths

## Test Plan
Run ./test_ck_tile_wg_32x16x128_fp4.

## Test Result
```
./test_ck_tile_wg_32x16x128_fp4
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1751 ms total)
[  PASSED  ] 3 tests.
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-30 01:28:48 +00:00
Emily Martins
95c916369c [rocm-libraries] ROCm/rocm-libraries#7584 (commit 060bad5)
[CK_TILE] Fix Stream-K k_size calculation

## Motivation

In a recent benchmarking task for CK Tile Stream-K algorithm, we
identified that certain instances segfault. This change works to fix the
bug and adds necessary regression tests.

## Technical Details

The StreamK kernel constructs tensor views using a `k_size` parameter
that determines how much of the K dimension to process in each
iteration. Previously, this was calculated as:
 ```cpp
index_t k_size = num_loop_sk * TilePartitioner::KPerBlock;
```
This calculation assumes all macro tiles along K are exactly `KPerBlock` in size. However, when `K % KPerBlock != 0`, the final macro tile along K has a remainder size of `K % KPerBlock`, not a full `KPerBlock` (see the figure below):
<img width="961" height="488" alt="image" src="https://github.com/user-attachments/assets/3e1cceed-5dcd-4980-8b02-cee24eecf262" />
With the old code, a workgroup working with the `MPerBlock x (K % KPerBlock)` tile in A and B risk accessing illegal memory.

Hence, this change ensures that when `K % KPerBlock != 0`, workgroups processing iterations that include the final macro-tile along K calculate the correct `k_size` based on the remainder rather than assuming a full `KPerBlock`.

## Test Plan
I added the following tests:
1. Unit tests added for the Stream-K Tile Partitioner:
- `StreamKTilePartitionerBaseGetKSize/NoRemainderTiles` - validates full tiles
- `StreamKTilePartitionerBaseGetKSize/RemainderTiles` - validates remainder handling
2. Regression tests that test a case where `K % KPerBlock != 0`

## Test Result

Tests passed locally on gfx90a, gfx942, and gfx950.

## Submission Checklist

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 21:36:49 +00:00
Andriy Roshchenko
d5c9215064 [rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC

## Motivation

Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async
load instruction for tensor A, which complements the existing MX GEMM
FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines
are enabled on gfx1250.

## Technical Details

The existing gfx950 implementation was extended to support gfx1250
architecture. All three MX FP data types are supported across the two
ASICs.
It should be noted that while the TDM pipeline uses an emulated
32x32x128 warp-tile instruction, the present submission relies on the
built-in 16x16x128 instruction, called 4 times per warp.

## Test Plan

Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250
functionality.

To help facilitate the testing in development,
`example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was
introduced to verify various combinations of supported data types and
pipeline versions.

## Test Result

The present submission is expected to work on both gfx950 and gfx1250
hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
- [x] Relies on #6978 and should only be merged after the changes are
merged to the `develop`.
2026-05-29 17:02:45 +00:00
Aviral Goel
4aecc8de5b [rocm-libraries] ROCm/rocm-libraries#7442 (commit b7d57ef)
[CK] CompV4: remove redundant barrier (+5.7% gfx942, +1% gfx950) (#7442)

## Summary

- Remove one redundant `block_sync_lds()` from the pong phase of the
CompV4 GEMM pipeline hot loop
- The pong phase had 2 barriers while ping had 1 — the second pong
barrier (after LDS writes, before global loads) was unnecessary because
the sync at the top of the next ping iteration already ensures LDS
coherence
- Removing this barrier allows global loads to overlap with LDS write
drain, restoring the latency hiding the ping-pong design was built to
provide
- Abstracting away Ping Pong phases into generic lambda avoids making
such mistake again.

## Benchmark

### gfx942 (MI300X), 86 fp16 GEMM shapes

| Metric | Value |
|---|---|
| Improved (>1%) | **80** |
| Neutral (±1%) | **4** |
| Regressed | **2** |
| Average gain | **+5.7%** |
| Best gain | +18.0% (4096x256x16384) |
| Worst regression | -2.9% (12288x3072x4096) |

### gfx950 (MI355X), 86 fp16 GEMM shapes

| Metric | Value |
|---|---|
| Improved (>1%) | **32** |
| Neutral (±1%) | **54** |
| Regressed | **0** |
| Best gain | +9.0% (4096x2048x28672) |

### Top gains by workload

| Shape (MxNxK) | Source | gfx942 BL | gfx942 Opt | gfx942 Gain | gfx950
BL | gfx950 Opt | gfx950 Gain |
|---|---|---|---|---|---|---|---|
| 4096x256x16384 | bloom_fc2 | 38.3 | 45.2 | **+18.0%** | 75.6 | 77.0 |
+1.9% |
| 4096x512x22016 | llama2_7b | 77.8 | 90.8 | **+16.7%** | 152.4 | 154.9
| +1.7% |
| 256x1536x7168 | deepseek | 14.4 | 16.7 | **+16.0%** | 27.2 | 28.0 |
+2.8% |
| 4096x1024x22016 | llama2_7b | 156.2 | 180.8 | **+15.7%** | 304.8 |
311.6 | +2.2% |
| 4096x1024x16384 | bloom_fc2 | 154.6 | 178.5 | **+15.4%** | 303.1 |
309.5 | +2.1% |
| 4096x4096x22016 | llama2_7b | 371.0 | 412.3 | **+11.1%** | 819.8 |
823.6 | +0.5% |
| 4096x2048x28672 | llama3_8b | 235.5 | 259.5 | **+10.2%** | 530.0 |
577.7 | **+9.0%** |
| 250880x256x4096 | bloom_logits | 289.0 | 335.9 | **+16.2%** | 595.5 |
599.1 | +0.6% |
| 8192x8192x8192 | square | 411.8 | 432.9 | **+5.1%** | 825.1 | 825.8 |
+0.1% |
| 7168x4096x8192 | llama70b | 362.9 | 374.7 | **+3.3%** | 775.8 | 782.5
| +0.9% |

## Hardware counter analysis (rocprof-compute, 8192x8192x8192, gfx942)

| Metric | Baseline | Optimized | Delta |
|---|---|---|---|
| s_barrier per ping+pong | 5 | 4 | **-1** |
| MFMA Utilization | 47.8% | 55.5% | **+7.7pp** |
| IPC | 0.17 | 0.21 | **+23.5%** |
| MFMA F16 % of peak | 30.6% | 33.5% | **+2.8pp** |
| VALU (instructions) | 41.67M | 41.67M | identical |
| MFMA (instructions) | 65.91M | 65.91M | identical |
| Spill/Stack Read | 8.27M | 8.27M | identical |

All instruction counts are identical — the optimization removed one
synchronization point, not any compute instructions.

## Correctness

- gfx942: GPU verification (`-v=2`) passed on 4 shapes (8192x8192x8192,
4096x4096x4096, 22016x4096x4096, 4096x512x28672)
- gfx950: GPU verification (`-v=2`) passed on all 86 shapes
2026-05-27 12:23:43 -04:00
Illia Silin
c24e528481 [rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)
[CK] suppress compiler warnings while building pytorch. (#7760)

## Motivation

Recently added compiler flags that are required to suppress false
warnings by latest staging compiler are not recognized by older compiler
versions and are triggering an avalanche of warnings. Previous attempt
to suppress them by using -Wno-unknown-warning-option flag didn't help,
because that flag wasn't recognized either and just added more warnings.
I've verified that current approach by checking the clang version
actually works as intended and makes the warnings go away.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-27 06:56:58 -07:00
assistant-librarian[bot]
6181eb2adf [rocm-libraries] ROCm/rocm-libraries#4279 (commit 5b3f4b7)
[CK_TILE] Stream-K XCD remapping (#4279)

## Proposed changes

This PR adds support for XCD remapping as detailed in this
[document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A).
On gfx942, workgroups are typically scheduled round-robin across XCDs,
which can lead to poor locality. We will use a remapping to assign
workgroups to contiguous tiles in the XCDs improving the locality and
the cache hit rate. This is done through a function that computes this
contiguous mapping from this
[PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have
added to the StreamKTilePartitioner. This will require minimal changes
to the Stream-K algorithm, only requiring a remap at the time the
workgroups are partitioned. Through this approach we can improve the
data locality by improving cache hits therefore closing performance gaps
that are seen with the default scheduling. There have been unit tests
added to verify the function in isolation. This is an optimization that
is not specialized to just Stream-K GEMM and can be applied across GEMM.

Note: This only applies to the gfx942 as they introduce the XCDs.

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged

---
🔁 Imported from
[ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652)
🧑‍💻 Originally authored by @arai713

---------

Co-authored-by: Astha <astha.rai713@gmail.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>
2026-05-26 09:43:03 -07:00
Anton Gorenko
66d6714376 [rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd)
[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388)

## Motivation

Improve precision of mxfp4 without performance penalties.

## Technical Details

Since performance of scale MFMAs is the same when neither A nor B is
fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the
second GEMM, while types of Q, K, V stay the same.
This allows to improve overall precision significantly because fp6 has
32 non-negative values used for P quantization compared to just 8 values
for fp4.

It was found that there is a compiler bug with
`__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in
LCOMPILER-561) but a workaround seems to fix all failing instances.

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-26 06:55:17 -07:00
JP-Fernando
74bc86240b [rocm-libraries] ROCm/rocm-libraries#5647 (commit 490437a)
[CK Tile] Add gemm universal preshuffle to MX GEMM  (#5647)

## Motivation

Add gemm universal preshuffle support to existing MX GEMM pipeline.

The straightforward way to do this is to port the `mx_flatmm` pipeline
to the existing `gemm_mx` framework.

## Technical Details

The `mx_flatmm` pipeline was not deleted, to allow for
back-compatibility.

## Test Plan

Add `preshuffle` option to example: `tile_example_mx_gemm`.

Add new configurations with enabled preshuffle to the existing
`test/ck_tile/gemm_mx` tests.

## Test Result

Example and tests were successful on `gf950` architecture in the `Alola`
cluster.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
2026-05-22 16:07:53 +02:00
Illia Silin
e02c566795 [rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)
[CK] upgrade CI to rocm7.13 as default compiler (#7612)

## Motivation

Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2026-05-22 02:43:50 +00:00
JiaLuo-CAN
5ff7497fa7 [rocm-libraries] ROCm/rocm-libraries#7537 (commit 07123f4)
[CK Tile] Fix Grouped Gemm quant mixed precision (#7537)

<Migrate from Internal repo PR>
test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8
cases:
std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant,
False, True, False>,
std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant,
False, True, False>

GFX1250 would fail with incorrect results, GFX950 would fail when
compiling BF8+FP8 and give incorrect results for FP8+BF8.
The issue is due to the wrong ComputeDataType selection.
The fix is to consider original ADataType and BDataType even when
ComputeDataType is not void. For compiling error on gfx950, the bf8,
fp8, 16x16x32 warp Gemm is added.
2026-05-21 08:36:23 -07:00
JP-Fernando
e7798e9560 [rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac)
Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112)

## Motivation

The goal of this work is to apply XOR shuffle (swizzle) to the current
`comp_async` GEMM pipeline and the `gemm_mx` pipeline.
XOR swizzling has been helpful to avoid LDS bank conflicts, as data are
redistributed across LDS banks, such that simultaneous threads accessing
different rows land on different LDS banks.

## Technical Details

A similar approach to the work in the existing eight-waves pipeline was
followed.
Currently, XOR swizzle support is available for FP8 and BF8 types.
FP4 support is also available for MX GEMM.
Should the types not match, or should the async vector width be of an
unsupported size, then the pipeline falls through to the previously
existing ('unswizzled') path.

## Test Plan

Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM
pipeline.
Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for
the MX GEMM pipeline.

## Test Result

The tests passed successfully in the `Alola` cluster with MI350
hardware.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-05-21 09:36:41 +02:00
Aviral Goel
458dd0ac4c [rocm-libraries] ROCm/rocm-libraries#7130 (commit 9e1e065)
[CK_TILE] Redesign LDS store API with pre-computed window coordinates (+15% MI355X, +6% MI300X) (#7130)

## Summary

- Redesign the LDS store API to separate window creation from memory
transfer
- Add `MakeDistributedLdsStoreWindow` factory, `LocalStore` (fast path),
and `LocalStoreWithCoordRecompute` (slow path) to the pipeline base
class
- Convert CompV3 as the reference implementation
- Document the slow/fast path distinction across core tensor headers

## Motivation

`LocalPrefill` hides a performance cliff: when given a bare
`tile_window_with_static_lengths`, it silently reconstructs
`tile_window_with_static_distribution` on every call — paying
significant VALU overhead (~96 for typical configurations) for XOR
coordinate computation. The cost is invisible at the call site.

The new API makes the cost explicit via three verbs:

| Verb | Method | Cost | When to use |
|------|--------|------|-------------|
| **Create** | `MakeDistributedLdsStoreWindow(bare, dstr)` | VALU (once)
| Before hot loop, when VGPR budget allows |
| **Store (fast)** | `LocalStore(precomputed_window, tensor)` | 0 VALU
for coords | Pre-computed window available |
| **Store (on-the-fly)** | `LocalStoreWithCoordRecompute(bare, tensor)`
| VALU per call | VGPR budget tight, or one-shot stores |

Both `LocalStore` and `LocalStoreWithCoordRecompute` enforce correct
window types via `static_assert`. `LocalPrefill` is retained for
backward compatibility (69 call sites across 6 pipeline files).

## Performance

### 86 Shapes, CompV3_2 (128×128 tile), fp16, RCR layout

**gfx942 (MI300X): 86/86 improved, 0 regressions. Average gain: +6.2%**
**gfx950 (MI355X): 85/86 improved, 1 neutral, 0 regressions. Average
gain: ~+15%**

<img width="2777" height="1178" alt="pr7130_perf_chart"
src="https://github.com/user-attachments/assets/b2f5c406-eb20-469d-8da6-dd608c28fbcc"
/>

| Shape (MxNxK) | Source | gfx942 | gfx950 |
|---|---|---|---|
| 22016x256x4096 | llama2_7b_fc1 | +5.3% | +11.4% |
| 22016x512x4096 | llama2_7b_pfill | +5.9% | +10.9% |
| 4096x512x22016 | llama2_7b_pfill | +7.6% | +28.5% |
| 22016x1024x4096 | llama2_7b_pfill | +6.1% | +10.1% |
| 4096x1024x22016 | llama2_7b_pfill | +7.4% | +17.2% |
| 22016x4096x4096 | llama2_7b_pfill | +5.2% | +9.3% |
| 4096x4096x22016 | llama2_7b_pfill | +6.0% | +9.3% |
| 4096x4096x4096 | llama2_7b_pfill | +5.7% | +10.6% |
| 28672x256x4096 | llama3_8b_fc1 | +5.4% | +12.2% |
| 28672x512x4096 | llama3_8b_pfill | +4.9% | +6.4% |
| 4096x512x28672 | llama3_8b_pfill | +7.4% | +1.5% |
| 28672x2048x4096 | llama3_8b_pfill | +4.9% | +8.6% |
| 4096x2048x28672 | llama3_8b_pfill | +6.4% | +8.4% |
| 28672x8192x4096 | llama3_8b_pfill | +5.4% | +8.0% |
| 7168x1024x8192 | llama70b_pfill | +6.6% | +10.8% |
| 8192x1024x7168 | llama70b_pfill | +6.4% | +11.4% |
| 7168x4096x8192 | llama70b_pfill | +6.2% | +9.6% |
| 16384x256x4096 | bloom_fc1 | +6.4% | +20.3% |
| 16384x512x4096 | bloom_fc1 | +5.8% | +8.5% |
| 16384x1024x4096 | bloom_fc1 | +6.0% | +10.9% |
| 16384x2048x4096 | bloom_fc1 | +5.3% | +10.1% |
| 16384x3072x4096 | bloom_fc1 | +5.5% | +8.8% |
| 16384x4096x4096 | bloom_fc1 | +5.7% | +8.8% |
| 4096x256x16384 | bloom_fc2 | +7.8% | +33.6% |
| 4096x512x16384 | bloom_fc2 | +7.5% | +31.6% |
| 4096x1024x16384 | bloom_fc2 | +7.1% | +17.1% |
| 4096x2048x16384 | bloom_fc2 | +6.9% | +11.0% |
| 4096x3072x16384 | bloom_fc2 | +6.8% | +11.0% |
| 4096x4096x16384 | bloom_fc2 | +6.7% | +10.3% |
| 12288x256x4096 | bloom_inproj | +6.7% | +22.0% |
| 12288x512x4096 | bloom_inproj | +6.2% | +9.8% |
| 12288x1024x4096 | bloom_inproj | +5.9% | +12.4% |
| 12288x2048x4096 | bloom_inproj | +5.8% | +10.1% |
| 12288x3072x4096 | bloom_inproj | +5.4% | +10.1% |
| 12288x4096x4096 | bloom_inproj | +5.7% | +9.1% |
| 250880x256x4096 | bloom_logits | +2.6% | +0.5% |
| 4096x256x4096 | bloom_outproj | +7.1% | +28.4% |
| 4096x512x4096 | bloom_outproj | +6.8% | +27.4% |
| 4096x1024x4096 | bloom_outproj | +6.5% | +21.3% |
| 4096x2048x4096 | bloom_outproj | +5.9% | +13.1% |
| 4096x3072x4096 | bloom_outproj | +5.9% | +12.0% |
| 16x1536x7168 | deepseek | +7.7% | +34.7% |
| 32x1536x7168 | deepseek | +7.7% | +34.9% |
| 64x1536x7168 | deepseek | +7.6% | +31.3% |
| 128x1536x7168 | deepseek | +7.6% | +25.8% |
| 256x1536x7168 | deepseek | +7.7% | +27.9% |
| 512x1536x7168 | deepseek | +7.6% | +29.1% |
| 1024x1536x7168 | deepseek | +7.3% | +28.8% |
| 2048x1536x7168 | deepseek | +6.9% | +20.5% |
| 4096x1536x7168 | deepseek | +6.3% | +11.0% |
| 8192x1536x7168 | deepseek | +6.2% | +11.3% |
| 16384x1536x7168 | deepseek | +6.0% | +9.1% |
| 20480x1536x7168 | deepseek | +4.8% | +9.3% |
| 16x3072x1536 | deepseek | +6.3% | +25.1% |
| 32x3072x1536 | deepseek | +6.4% | +25.3% |
| 64x3072x1536 | deepseek | +6.4% | +24.8% |
| 1024x1024x1024 | square | +5.5% | +18.7% |
| 2048x2048x2048 | square | +6.0% | +19.2% |
| 3584x3584x3584 | square | +5.3% | +11.2% |
| 5120x5120x5120 | square | +6.1% | +10.0% |
| 6144x6144x6144 | square | +5.5% | +9.8% |
| 8192x8192x8192 | square | +6.0% | +8.2% |
| 1024x4608x1024 | midsize | +4.6% | +4.6% |
| 512x18432x512 | midsize | +1.9% | +10.1% |
| 4096x18432x4096 | midsize | +5.8% | +8.8% |
| 320x8192x320 | stablediff | +4.0% | +11.3% |
| 640x2048x640 | stablediff | +4.5% | +14.0% |
| 320x8192x1280 | stablediff | +5.6% | +20.1% |
| 1x1280x8192 | skinny_m1 | +7.7% | +35.3% |
| 1x8192x1024 | skinny_m1 | +6.0% | +20.3% |
| 1x7168x8192 | skinny_m1 | +7.7% | +36.6% |
| 1x8192x3584 | skinny_m1 | +7.3% | +27.9% |
| 1x13312x6656 | skinny_m1 | +7.6% | +30.3% |
| 1x13312x16384 | skinny_m1 | +7.8% | +4.2% |
| 1x16384x6656 | skinny_m1 | +7.5% | +28.7% |
| 1x16384x16384 | skinny_m1 | +7.7% | +2.3% |
| 16x4096x4096 | skinny_m16 | +7.4% | +31.9% |
| 16x22016x4096 | skinny_m16 | +7.5% | +26.5% |
| 16x28672x4096 | skinny_m16 | +7.0% | +15.1% |
| 16384x1280x8192 | skinny_m16 | +5.6% | +8.7% |
| 16384x8192x1024 | skinny_m16 | +4.5% | +8.8% |
| 2048x4096x2048 | mixed | +4.7% | +9.0% |
| 4096x2048x8192 | mixed | +6.8% | +11.0% |
| 8192x4096x4096 | mixed | +5.2% | +10.0% |
| 1x4096x4096 | mixed | +7.4% | +32.4% |
| 1024x1024x4096 | mixed | +7.1% | +27.4% |

### ISA Hot Loop Diff (LBB1_32, per K-iteration, gfx942)

| Metric | Baseline | Optimized | Delta |
|--------|----------|-----------|-------|
| Total VALU | 621 | 500 | **-121** |
| VGPR / SGPR | 512 / 96 | 512 / 96 | unchanged |

### Hardware Counters — Instruction Mix (gfx950, rocprofiler-compute)

Profiled on MI350X, shape 4096×256×16384 (bloom_fc2). Instruction counts
are deterministic hardware counters.

| Metric | Baseline | Optimized | Δ |
|--------|----------|-----------|---|
| **VALU instructions/kernel** | 4,642,473 | 987,958 | **−78.7%** |
| **INT32 VALU** | 2,592,786 | 541,129 | **−79.1%** |
| Instructions / wavefront | 39,178 | 24,400 | −37.7% |
| VGPRs (avg) | 98 | 90 | −8% |
| **MFMA instructions** | 2,059,702 | 2,059,702 | **0%** |
| **LDS instructions** | 1,564,891 | 1,564,891 | **0%** |
| **VMEM instructions** | 520,996 | 520,996 | **0%** |

MFMA as fraction of total instructions: **30.7% → 67.5%**. Eliminating
~3.65M redundant INT32 VALU instructions (XOR coordinate recomputation
per K-iteration) leaves the scheduler more headroom for MFMA dispatch,
directly explaining the benchmark gains.
2026-05-19 20:22:37 -04:00
Enrico Degregori
9565ca21ec [rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2)
[CK Tile] Eight Waves pipeline for MX GEMM (#5552)

## Motivation

Integrate Eight Waves pipeline in MX GEMM

## Technical Details

 - EightWaves pipeline:
- Add pipeline, policy and block gemm (internally using existing
implementation used by GEMM and ABQuant)
   - Extend support of EightWaves policy for FP4 (packed types)
 - Async pipeline:
- Fix pipeline with packed scales (requires MRepeat and NRepeat to be
contiguous)
- block gemm specific for MX GEMM is defined because distribution
encodings have changed
 - CShuffle:
- Add new functionality to support MRepeat and NRepeat contiguous
(defined by `TilesPacked`)
 - Examples:
- Refactor examples to easily switch different configurations (similar
to GEMM universal)
- Scales values generated consistently with other microscale
implementations in CK Tile
   - Add configuration for EightWaves pipeline
 - Tests:
   - Unify existing FP8 and FP4 tests
   - Add tests for EightWaves pipeline
- Scales values generated consistently with other microscale
implementations in CK Tile

Note: FP6 support for MX GEMM was added later and the support for the
Eight Waves pipeline will be done in following PR

## Test Plan

Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and
FP8

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-19 11:53:19 -07:00
Illia Silin
717f2efef7 [rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)
[CK] add composable kernel support on gfx1250 (#6978)

## Motivation

Add composable kernel support on gfx1250.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
2026-05-15 06:46:51 -07:00
Illia Silin
ac18460782 [rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70)
[CK] Suppress new staging compiler errors (#7384)

## Motivation

This should make new builds with staging compiler pass.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-14 12:51:08 -07:00
ltqin
501e7ef12a [rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)
[CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574)

## Summary

Add a CK_TILE forward kernel implementing [SageAttention
v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that
applies multi-granularity quantization to Q/K/V before computing
attention, trading minimal accuracy loss for higher throughput on
low-precision hardware.

### Quantization design

| Tensor | Supported data types | Scale granularity options |
|--------|---------------------|--------------------------|
| Q | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(32 tokens), per-thread (4 tokens) |
| K | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(64 tokens), per-thread (16 tokens) |
| V | fp8 | per-channel (always) |
| O | bf16 | — |

Three precision combinations are supported: `fp8/bf16` (QKV fp8, O
bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK
int4, V fp8, O bf16).

### Architecture support

- **gfx9** (CDNA2/3, e.g. gfx90a, gfx942) — full tile set
- **gfx950** (CDNA4) — restricted tile set (N-per-block capped at 64 for
fp8-family dtypes)

### Implementation

- Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC`
(async copy)
- Masking support: no mask, causal (top-left / bottom-right), and
generic windowed
- Batch and group (variable-length) modes
- Head dimension: d=128, d_v=128
- Python codegen under `example/ck_tile/49_sageattention/codegen/`
generates kernel instances per target/dtype/tile combination
- Smoke tests included via `tile_example_sageattn_fwd`

### Test commands

\`\`\`bash
# fp8 QKV
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=fp8bf16 -qscale=3 -init=3

# int8 QK, fp8 V
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=i8fp8bf16 -qscale=3 -init=3
\`\`\`

\`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread
2026-04-30 11:32:23 -07:00
Qianfeng
0e6a514e4f [rocm-libraries] ROCm/rocm-libraries#6209 (commit 89c9f3e)
Improve the performance of qr_ks_vs_whole_k_prefetch pipeline (#6209)

## About qr_ks_vs_whole_k_prefetch pipeline
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to
improve performance on both MI350 GPUs through better MFMA instruction
usage, transposed V-loading support, and N0-loop implementation. The
pipeline targets scenarios where the number of workgroups is low,
enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs
128) while prefetching entire K tiles.

## Changes:

- Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload)
to avoid using shuffle instructions on MI350
- Implements N0-loop based Gemm0 to reduce tile window movement overhead
and eliminate `clear_tile` calls
- Adds full support for hdim96/hdim160 without padding requirements
- Updates MFMA instruction selection to ensure optimal choices for MI350

## Performance results

1. For attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance
than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms`
by whole_k_prefetch_trload & `58.50ms` by async_load), and
`qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better
performance than the recently tuned `qr_ks_vs_async` on the same case
(execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by
qr_ks_vs_async)
2. Also on MI300, for attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch` shows much better performance than the
`qr_ks_vs_async` (which is supposed to be very high-efficient) on the
same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by
qr_ks_vs_async)
3. For attention shapes which leads to kM0=128,
`qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better
performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by
whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows
completely on-par performance on MI300

## Test/Verify

1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to
test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can
not be used by ck_tile fmha example so far
2.  Use the following command-line for building/testing xformers
>```bash
> #> git clone -b test_whole_k_prefetch_n0loop
https://github.com/ROCm/xformers
> #> git submodule update --init --recursive
> #> pip  install --no-build-isolation -e ./
> #> pytest tests/test_mem_eff_attention.py::test_forward
>```
4. Any scripts which can run on xformers can be used to evaluate
qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to
switch from using different pipelines
> ```bash
> #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3
and qr_ks_vs_async_trload pipeline
> #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using
qr_ks_vs_async pipeline for comparing
> ```

## Discussion

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com>
Co-authored-by: qianfengz <12429178+qianfengz@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-04-24 10:30:41 -06:00
Illia Silin
d16061f578 [rocm-libraries] ROCm/rocm-libraries#6550 (commit c396de9)
[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550)

## Motivation

New changes from upstream llvm-project cause an avalanche of warnings in
CK. Gonna disable them by ignoring the
lifetime-safety-intra-tu-suggestions flag until a better permanent
solution is found.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-22 15:47:47 +00:00
Sami Remes
de3fa71992 [rocm-libraries] ROCm/rocm-libraries#6611 (commit 5375c0f)
[CK_TILE] Preserve input strides in EightWaves async-load descriptor (#6611)

`MakeAsyncLoadADramWindow` in
`GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view
descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes
strides from lengths and assumes a dense layout. When the input view's
leading-dim stride is larger than its inner length (non-packed memory
layout), the resulting tile window stepped through memory at the wrong
stride.

Compose the unmerge transforms on top of the input view's existing
descriptor instead, so the actual runtime strides are preserved and the
correct `element_space_size` is inherited for bounds checking.

## Test Plan

Added an unit test showing the problem.

## Test Result

The new test fails before fixes and passes after.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-22 12:52:02 +02:00
Hosang Yoon
7e4e291771 [rocm-libraries] ROCm/rocm-libraries#6450 (commit b75fed1)
[CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450)

## Motivation

`qr_hpad` currently executes work for padded head-dim fragments even
when only a subset of the values are valid. This adds unnecessary
computation for head dimensions that require padding, such as `hdim=72`
and `hdim=80`, and hurts FMHA forward performance.

The goal of this PR is to make the padded-head-dim path skip invalid
work based on the actual valid fragment count, while preserving the
existing behavior for the non-padded path.

## Technical Details

This PR improves the `qr_hpad` FMHA forward path in three parts:

- Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part
of the fragment is valid.
- Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the
valid fragment range instead of always computing over the padded extent.
- Retune the gfx11 `qr_hpad` kernel configuration after enabling the
partial-fragment path.

To keep the existing path stable, the implementation adds overloads for
the updated GEMM/pipeline interfaces. This allows existing full-tile
callers to keep using the previous form, while the `qr_hpad` path can
pass valid fragment counts when needed.

## Test Plan

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result

- On gfx11 and gfx12, for head dimensions that require padding,
`tile_example_fmha_fwd` shows about 20-30% performance improvement at
`hdim=72/80`.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-18 06:44:46 +00:00
chris-tsiaousis-hpc
1c04029c0e [rocm-libraries] ROCm/rocm-libraries#5508 (commit 0ad0aca)
[CK Tile] Unification work - mma transformations pipeline (#5508)

## Motivation

In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing.
For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform.

A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR.
The pipeline base class is basically at the RFC stage.

We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline.

## Technical Details

The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed.

## Test Plan

New tests are added.

## Test Result

Tests should pass.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-04-14 09:25:01 +02:00
Aviral Goel
c7eb33078c [rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)
CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302)

Depends on #6300

## Summary

Remove 41 commented-out code blocks across 33 files in Composable
Kernel, totaling ~200 lines.

Identified using an automated dead code scanning skill (`ck-dead-code`)
with a calibrated two-stage pipeline:
1. **Pre-filter**: Keyword-based scan found 1,338 `//`-commented blocks.
Calibrated heuristics (trained on 50-sample expert classification)
reduced to 89 high-confidence candidates — 93% noise reduction.
2. **Expert triage**: LLM expert classified each block in context as
CODE_REMOVE, CODE_KEEP, or NOT_CODE.

| Classification | Count |
|---------------|-------|
| Removed (this PR) | 41 |
| Kept (debug helpers, alt configs, reference impls) | 32 |
| Not code (false positives) | 16 |

Removed blocks include: superseded implementations, old test data,
abandoned stubs, unreachable code, and buggy dead code.
2026-04-10 11:17:11 -04:00
Christopher Millette
c6e5cd65d3 [rocm-libraries] ROCm/rocm-libraries#5939 (commit 6fb1791)
[CK_TILE] Flatten nested static_for loops into static_ford (#5939)

## Summary
Mechanical conversion of 129 nested `static_for`/`static_ford` patterns
to flat `static_ford` across 29 ck_tile header files.

Each conversion eliminates intermediate lambda closure instantiations by
replacing nested compile-time loops with a single flat iteration using
index decomposition.

### What `static_ford` eliminates

When `static_for` loops are nested, each level creates unique closure
types:
```cpp
// BEFORE: M + M×N = 20 IR functions (for M=4, N=4)
static_for<0, 4, 1>{}([&](auto m) {        // 4 closure instantiations
    static_for<0, 4, 1>{}([&](auto n) {     // 4×4 = 16 closure instantiations
        body(m, n);
    });
});

// AFTER: M×N = 16 IR functions (with ford_applier, no intermediates)
static_ford<sequence<4, 4>>{}([&](auto mn) {
    constexpr auto m = number<mn[number<0>{}]>{};
    constexpr auto n = number<mn[number<1>{}]>{};
    body(m, n);
});
```

### Pattern categories converted

| Category | Count | Description |
|----------|-------|-------------|
| C (2-level `static_for` chains) | 112 | Nested `static_for` →
`static_ford` |
| C3 (3-level `static_for` chains) | 9 | Three consecutive nests →
`static_ford` |
| Partial rescue | 3 | Outer 2 levels of blocked 4-level nests |
| B (nested `static_ford` merge) | 5 | Two nested `static_ford` → single
higher-dim `static_ford` |
| **Total** | **129** | Across 29 files |

6 false positives were detected and reverted (in `tensor_adaptor.hpp`,
`tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the
inner loop bound depended on the outer variable.

### Files changed by family

| Family | Files | Sites |
|--------|-------|-------|
| Block GEMM | 12 | ~20 |
| FlatMM pipelines | 4 | ~69 (including 5 ford-ford merges) |
| GEMM quant | 7 | ~22 |
| FlatMM kernel | 1 | 2 |
| FMHA | 1 | 2 |
| Reduce/norm | 2 | 2 |
| Epilogue | 1 | 1 |

### Blocked locations from review comments

- **block_gemm_areg_breg_creg_v1.hpp:356** — BLOCKED: runtime scale
loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between
every nesting level
- **block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228** — BLOCKED:
`zero_accumulators()` before inner loop; `sched_barrier` + conditional
`block_sync_lds()` after inner loop
- **block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298** — BLOCKED:
runtime `CWarpTensor` construction before inner loop; quantization scale
application code after inner loop
- **block_universal_gemm_as_aquant_bs_cr.hpp:277** — BLOCKED: same
pattern as above
- **block_universal_gemm_as_bs_bquant_cr.hpp:367** — BLOCKED: same
pattern as above

## Depends on
- #5938 ([CK_TILE] Optimize static_ford and sequence compile-time
infrastructure) — provides the `ford_applier` that makes these
conversions beneficial. Without it, `static_ford` uses a recursive
implementation that provides no IR function savings.

## Results (combined with #5938)

### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942)

| Target | Base (s) | Treat (s) | Delta | % | Significant? |
|--------|----------|-----------|-------|---|-------------|
| **flatmm** | 161.1 | 149.0 | **-12.1s** | **-7.5%** | **YES** (p<0.01,
7/7 wins) |
| **universal_gemm** | 225.4 | 220.3 | **-5.1s** | **-2.3%** | **YES**
(p<0.01, 7/7 wins) |

### IR Function Counts (device trace, gfx942)

| Target | InstFunc | CodeGen |
|--------|----------|---------|
| universal_gemm | **-8.5%** | **-9.2%** |
| flatmm | **-7.6%** | **-10.5%** |

### ASM Equivalence
5/5 PASS — 650,151 lines verified identical (gfx942). TUs:
universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale.

## Test plan
- [x] ASM equivalence verified (650K lines, gfx942)
- [x] Wilcoxon timing verified (7 trials, p<0.01)
- [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction)
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2026-04-07 08:36:45 -06:00
Po Yen Chen
bf736dfa74 [rocm-libraries] ROCm/rocm-libraries#6051 (commit f0838b2)
[CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051)

## Motivation

The existing FMHA V3 pipeline only supports fp16/bf16 data types. This
PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950,
enabling higher throughput for
  FP8 inference workloads using the assembly-optimized V3 code path.

  ## Technical Details

  **Warp GEMM:**
- Add FP8 32x32x32 warp gemm with C-transposed distribution
(`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries

  **V3 Kernel (`fmha_fwd_v3_kernel.hpp`):**
- Add per-tensor descale support for Q, K, V tensors, passing descale
pointers through to pipeline kargs

  **V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`):**
  - Add FP8 data path with dtype-aware type selection
  - Add asm volatile P matrix conversion from f32 to fp8
  - Add FP8-aware instruction scheduling in `CoreLoopScheduler`

**V3 Pipeline Policy
(`block_fmha_fwd_v3_pipeline_default_policy.hpp`):**
- Add FP8 QK warp gemm selection (SwizzleB variant for V tile
distribution compatibility)

  **Codegen (`fmha_fwd.py`):**
  - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128)
- Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale:
no/pertensor)
  - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor

  **Misc:**
- Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum
(`arch.hpp`)
- Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`)

V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`)
pending further validation.

## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950)

  | Problem | Regular (TFlops) | Varlen (TFlops) |
  |---|---:|---:|
  | batch=1 heads=6/1 seqlen=1024 causal | 48.9 | 47.6 |
  | batch=1 heads=6/1 seqlen=2048 causal | 119.8 | 117.4 |
  | batch=1 heads=6/1 seqlen=4096 causal | 263.7 | 259.2 |
  | batch=1 heads=6/1 seqlen=8192 causal | 548.9 | 543.6 |
  | batch=1 heads=6/1 seqlen=16384 causal | 1043.0 | 1063.7 |
  | batch=1 heads=6/1 seqlen=32768 causal | 1237.2 | 1279.6 |
  | batch=1 heads=6/1 seqlen=65536 causal | 1315.4 | 1382.7 |
  | batch=1 heads=6/1 seqlen=131072 causal | 1326.3 | 1402.2 |
  | batch=1 heads=16/1 seqlen=65536 causal | 1298.7 | 1388.4 |
  | batch=1 heads=40/40 seqlen=37200 non-causal | 1248.9 | 1326.1 |

## Test Plan

Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering
batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40),
GQA ratios (1:1, 1:8), and
causal/non-causal modes. Verified all cases dispatch to the V3 pipeline
by enabling `F_is_v3_enabled` and confirming kernel names contain
`qr_async_trload_v3`.

  ## Test Result

176/176 tests passed with V3 enabled. All cases correctly dispatched to
V3 pipeline with `pertensor` quantization.

  ## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-07 22:19:28 +08:00
Chinmay Dattanand Kuchinad
9a49642035 [rocm-libraries] ROCm/rocm-libraries#5776 (commit ee1bbcb)
[CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776)

## Motivation

Fix pivot mismatch in the persistent GEMM kernel's async input scheduler
that causes **GPU hangs** and incorrect results when used with AsyncTP
(Asynchronous Tensor Parallelism) on ROCm.

PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM
kernel with chunk signals to overlap communication and computation. The
pivot mechanism ensures each rank starts computing from its own local
shard first (which is already available), then moves to remote chunks as
they arrive over the network.

Because of the pivot mismatch, the kernel frequently waits on signals
for chunks that have not yet arrived, while attempting to read data from
completely different chunks. This synchronization desync reliably
triggers infinite hangs during multi-GPU native AsyncTP execution. This
fix is required to enable functional AsyncTP support on ROCm.

## Technical Details

In the persistent kernel loop (`UniversalGemmKernel::operator()`), the
M-tile coordinate used for data selection (`i_m`) and the M-tile
coordinate used for the chunk-signal wait (`chunk_idx`) were derived
from inconsistent bases:

* `i_m` was computed from the **unpivoted** tile index `iM`.
* `chunk_idx` was computed from the **pivoted** expression `(iM +
tile_idx_pivot)`.

This means the kernel could wait for chunk N's signal but then read from
chunk M's memory, or vice versa. The mismatch scales with GPU count:
with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc.

**The Fix:**
Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m`
and `chunk_idx` from it. This guarantees the kernel always waits for the
correct chunk before reading its data.

*(Note: Minor cosmetic `clang-format` changes were also pulled in
alongside the fix).*

## Test Plan

1. Build PyTorch with this CK change.
2. Run the specific multi-GPU AsyncTP native test:
`timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest
test/distributed/test_symmetric_memory.py -k
test_fused_all_gather_matmul_native -q -s -x`

## Test Result

Tests verify correct overlapping execution without hangs or accuracy
mismatches when running the AsyncTP native path with non-zero pivots.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-04-01 09:21:20 -07:00
Bartłomiej Kocot
1f30eb4a54 [rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690)
[CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842)

## Motivation

Force padding for atomic_add bf16 C tensor to avoid memfaults.

## Technical Details

- add global atomic add for bf16 and enable them
- add padding for atomic add bf16 due to the lack of oob
- remove padding for not continous dims in conv for other cases
- minor bwd data conv fixes

## Test Plan

test_grouped_conv_*_tile

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-31 10:02:24 +02:00
Linjun-AMD
7469320248 [rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2)
[CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095)" (#5849)

This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 20:36:39 +00:00
Johannes Graner
d219484b46 [rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649)
[CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393)

## Motivation

Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.

## Technical Details

StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)

Both persistent and non-persistent data-parallel (DP) sections are
supported.

Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression

### Performance (MI355X, gfx950)

Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):

| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |

StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.

## Test Plan

```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk

# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```

30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression

## Test Result

All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-27 10:17:10 +01:00
joyeamd
cb76c8b8e2 [rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6)
[CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789)

## Motivation

fix ck_tile's oob check.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-26 09:40:44 +08:00
joyeamd
9abcf2bfa2 [rocm-libraries] ROCm/rocm-libraries#5697 (commit dd1c396)
Revert "Ck/joye/revert oob check (#5640)" (#5697)

This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-23 22:04:14 +00:00