3438 Commits

Author SHA1 Message Date
Márton Bidlek
0d18f4fc05 [rocm-libraries] ROCm/rocm-libraries#4798 (commit 0acaf5f)
Using named functors instead of lambdas

## Motivation

Currently, in block-level GEMM pipelines, there is significant code
repetition for prefetching and tail handling, where lambda functions
create a unique instantiations at each call. This includes repeated
static_for instantiations and large loops such as MRepeat. Each
repetition results in additional instantiations, which increases
compilation time and binary bloat.

## Technical Details

Refactor repeated code blocks into named functors so the compiler can
reuse already instantiated code instead of generating multiple copies.

Scope of changes:

1. WMMAOPS pipeline internals:
projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_base.hpp,
projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_v1.hpp,
projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_v3.hpp
2. XDLOPS and preshuffle pipeline variants across
projects/composablekernel/include/ck/tensor_operation/gpu/block
(v1/v2/v3/v4/v5, scale, dequant, gufusion, moe, mx, blockscale,
skip-b-lds, dpp, xdlops)

Shared functor file:
projects/composablekernel/include/ck/utility/vector_load_functor.hpp

## Test Plan

Note that the provided compilation traces by -ftime-trace do not report
unnamed lambda instantiations, so a clear baseline for instantiation
counts cannot be established. As a result, the impact of this change
will be evaluated based on runtime performance rather than direct
instantiation-count comparisons.

## Test Result

The effects of this were timed by the compilation of a single HIP object
through an example (grouped_gemm_wmma_splitk_fp16.cpp). The average user
time and speedup of this using the average of 100 compilations is:
- Mean compile time before the changes: 37.734 s
- Mean compile time after:  32.087 s
- Speedup: 17.6%

Ran a full CK compilation on Alola with the following results:

| Metric | Before (min) | After (min) | Absolute Reduction (min) | %
Reduction |
| ------ | ------------ | ----------- | ------------------------ |
2026-06-08 17:11:53 +00:00
Emily Martins
674f7cdc0e [rocm-libraries] ROCm/rocm-libraries#8141 (commit d3defa6)
[CK] Remove Stream-K from old CK

## Motivation

Since Stream-K has a CK Tile implementation, we no longer need Stream-K
in old CK. Hence, this PR removes Stream-K from old CK.

## Technical Details

All Stream-K artifacts in old CK have been removed including examples,
tests, kernels, and CK profiler artifacts.

## Test Plan

Ran a CI run on the branch before publishing PR.

## Test Result

All tests passed.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
2026-06-08 16:47:26 +00:00
Johannes Graner
0b3c297ee2 [rocm-libraries] ROCm/rocm-libraries#8009 (commit 26ab70d)
[CK Tile] Add WAVELET pipeline for forward grouped
 convolution (#8009)

## Motivation

CK Tile forward grouped convolution trails classic CK on 3x3
convolutions whose
output-channel count is not divisible by 8, where the narrow output
store limits
the compute CShuffle epilogue. This ports the WAVELET pipeline (added
for
backward-weight in #7937) to the forward kernel to close that gap.

## Technical Details

- Kernel (`grouped_convolution_forward_kernel.hpp`): WAVELET
load/math-wave wiring,
mirroring the backward-weight implementation; the non-WAVELET path is
unchanged.
- Generator: implement `parse_native_fwd_instance`, the forward
native-instance parser.
- Registered WAVELET instances: profiler bf16 3 / fp16 5, tests 1 each.

WAVELET requires input channels divisible by 8 (it does not apply to
depthwise).
The bf16/fp16 instance asymmetry is intentional and measured: the VecC=8
tiles
never beat the compute pool in bf16 but win about 20% of divisible-by-8
3x3 shapes
in fp16, so VecC=8 is registered for fp16 only.

## Test Plan

- Correctness (CPU reference) for every registered profiler instance,
across VecC variants.
- Per-shape best-instance performance sweep over the 34 RetinaNet shapes
(bf16) and
a 200-shape cross-model sweep (bf16 and fp16), compared against classic
CK.

## Test Result

- Correctness: PASS for all instances.
- RetinaNet (bf16, vs classic CK): faster on 28 of 34 shapes, geomean
+19.5%; the
not-divisible-by-8 shapes up to 3.7x. One 1x1 stride-2 shape stays ~20%
behind
  classic CK, unrelated to WAVELET.
- Cross-model (200 shapes): WAVELET wins 3x3 not-divisible-by-8 in both
dtypes
(up to 61% over the next-best compute instance); for divisible-by-8 3x3
it wins
  about 20% of shapes in fp16 (3-11%) and none in bf16.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 08:57:39 +00:00
Johannes Graner
b7d59e4b5f [rocm-libraries] ROCm/rocm-libraries#8099 (commit fc4894b)
[CK Tile] Fix Stream-K flag store: wave-uniform SGPR address
 for scalar s_store/s_load (#8099)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Stream-K grouped-conv (and GEMM) kernels fail to assemble for some
instances: the inline scalar flag store/load gets a VGPR address
operand, which scalar-memory instructions reject (`invalid operand for
instruction`). This blocks Stream-K instances from building.

## Technical Details

- `StreamKReductionOps::{Signal,Wait}StorePartialDone` (shared by GEMM
and conv, added in #5393) take `kargs` by `const&` and feed
`kargs.workspace_ptr` / `cta_idx` into inline
`s_store_dword`/`s_load_dword` with `"s"` constraints. For some
instantiations the compiler can't keep the pointer wave-uniform and
emits a VGPR address.
- Fix: route the pointer and offset through `amd_wave_read_first_lane`
so the scalar-memory address is a wave-uniform SGPR before the asm. Same
instructions, no algorithm change.
- Not arch-specific: the affected instance fails on
gfx908/gfx90a/gfx942/gfx950 without the fix; whether the compiler spills
to a VGPR depends on the instantiation (tile/warp/pipeline), not the
target.

## Test Plan

- Compile the previously-failing dispatcher instance for
gfx908/gfx90a/gfx942/gfx950.
- `test_ck_tile_grouped_conv_bwd_weight_streamk` on gfx942, gfx90a,
gfx950 hardware.
- gfx950 perf A/B (example, bf16/tree, 10 runs each) with vs without the
change.

## Test Result

- Failing instance now assembles on all four archs; previously failed on
every one.
- 30/30 conv Stream-K tests pass on gfx942, gfx90a, gfx950.
- gfx950 perf delta -0.13% (within run-to-run noise) — no regression
from the added readfirstlane on the cold flag path.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 08:57:04 +00:00
Bartłomiej Kocot
28f2966762 [rocm-libraries] ROCm/rocm-libraries#7734 (commit 03ffb9d)
[CK] Grouped Convolution Global Load/Store instances

## Motivation

Support global load and store in grouped convolutions using instance
factory.

## Technical Details

- add new instances for each direction
- add new tests for large cases

## Test Plan

New test for large cases

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1255
2026-06-06 22:52:59 +00:00
Bartłomiej Kocot
2c363870d9 [rocm-libraries] ROCm/rocm-libraries#6744 (commit 9d056e8)
[Ck][CK Tile] Global Load/Store for Large Tensors support
 (#6744)

## Motivation

Create solution to support large tensors in the entire ck tile.

## Technical Details

- add possiblity to use global load
- int64 indexing

## Test Plan

conv fwd tests

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-913
2026-06-06 10:14:17 +00:00
Enrico Degregori
1b4fbd95fd [rocm-libraries] ROCm/rocm-libraries#6089 (commit c876d18)
[CK Tile] Extend type support EightWave pipeline

## Motivation

EightWave pipeline was designed for 8 bit types. This PR extend support
for any FP type

## Technical Details

 - Generalize policy to support any FP type
- Change LDS layout to fix bank conflicts. This removes all bank
conflicts in the pipeline (checked for all supported types). Remaining
bank conflicts are related to Cshuffle epilogue.

## Test Plan

Added GEMM tests with new supported types. Note that FP6 is also
supported for MX GEMM but the PR was reverted so no tests were added for
it.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 23:54:40 +00:00
Thrupti Raj Lakshmana Gowda
054436ca4a [rocm-libraries] ROCm/rocm-libraries#8079 (commit cf1e8f2)
[tile_engine] Integrate gemm_streamk into budget-based
 sampling system (#8079)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

`gemm_streamk` was the only GEMM op not participating in the tile
engine's budget-based sampling system. Without a budget cap, it would
always generate its full feasible set, making build times unpredictable
and inconsistent with the other ops.

## Technical Details

- **CMake budget propagation** (`ops/gemm/CMakeLists.txt`): Added
`gemm_streamk` to the active-ops detection loop so it receives a share
of the sampling budget. Because `gemm_streamk` lives in a sibling
subdirectory (`ops/gemm_streamk/`), its allocation is written via `CACHE
STRING "" FORCE` to make the variable visible across the CMake directory
boundary.

- **Per-combo budget division** (`ops/gemm_streamk/CMakeLists.txt`,
`ops/gemm/grouped_gemm/CMakeLists.txt`): Added the same per-combo
`MAX_INSTANCES` division that exists in `gemm_universal` and
`gemm_preshuffle`. The total budget is divided by `n_datatypes ×
n_layouts` before the inner `foreach` loop so that sampling fires
independently per `(dtype, layout)` combo rather than acting as a single
global cap.

- **Sampling integration** (`gemm_streamk_instance_builder.py`): Added
`_apply_sampling()` method to `GemmKernelBuilder`, mirroring the
Sobol+LHS+maximin sampling used by other ops. New constructor
parameters: `gpu_target`, `max_instances`, `seed`, `tier`,
`manifest_path`. New CLI arguments: `--gpu_target`, `--max-instances`,
`--seed`, `--tier`, `--manifest-path`. The `--gpu_target` argument is
now also forwarded on the `--list_kernels` invocation.

- **`GEMM_STREAMK_AXES`** (`sampling/feasible_set.py`): Defined as
`GEMM_AXES + ["reduction_strategy"]` to account for the extra axis
unique to stream-K. Added `reduction_strategy` to `CATEGORICAL_AXES`.

- **Weight rebalancing** (`sampling/op_weights.json`): Allocated 10%
weight to `gemm_streamk` by proportionally reducing `gemm_universal`
(0.35 → 0.30) and `gemm_preshuffle` (0.30 → 0.25). Total remains 1.00.

## Test Plan

- Configure with `TILE_ENGINE_SAMPLING_TIER=daily` and verify that
`gemm_streamk` receives a non-zero budget allocation and that
`GEMM_STREAMK_MAX_INSTANCES` is set correctly.
- Configure with `TILE_ENGINE_SAMPLING_TIER=daily` across multiple
`(dtype, layout)` combos and confirm per-combo budget = total /
n_combos.
- Configure with `-DGEMM_STREAMK_MAX_INSTANCES=50` explicit override and
verify the override is respected (budget allocation skipped).
- Verify `chosen_instances.json` manifest is written to the working path
when tier is active.
- Confirm `op_weights.json` weights still sum to 1.00.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 17:06:11 +00:00
Brock Hargreaves
b2a3ffea5d [rocm-libraries] ROCm/rocm-libraries#5945 (commit 8f9a5fe)
[CK] [MIOPEN] Split convolution library by layout
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

# Split Composable Kernel convolution operations by data layout

TLDR:

1. This is a reorganization of files, folders, and CMakeLists for
convolution kernels and facilitates a splitting of the convolution
library into layouts.
2. The speedup can range anywhere between 15-40% depending on the target
architecture for miopen only builds of CK. For TheRock nightly builds of
CK, which includes both miopen and hip tensor kernel instances, this
constituted in a 10% decrease in compile time for gfx1100.
## Overview

Based on https://github.com/ROCm/composable_kernel/pull/3010/ (except
keeping 1 static library)

## What MIOpen Actually Uses

MIOpen **exclusively uses:
- **NHWGC** for all 2D convolutions
- **NDHWGC** for all 3D convolutions
This is because MIOpen's tensor descriptors natively use channel-last,
group-aware formats.
## Key Changes

### 1. Layout-Based Directory Structure
Reorganized convolution instance files from flat per-operation to
hierarchical layout-based structure. For example:

**Before:**
grouped_conv2d_fwd/
├── device_grouped_conv2d_fwd_xdl_nhwgc_*.cpp (MIOpen-required)
├── device_grouped_conv2d_fwd_xdl_gnhwc_*.cpp (optional)
└── device_grouped_conv2d_fwd_xdl_ngchw_*.cpp (optional)
**After:**
grouped_conv2d_fwd/
├── nhwgc/ ← MIOpen-required
│   ├── xdl/device_grouped_conv2d_fwd_xdl_*.cpp
│ └── wmma/device_grouped_conv2d_fwd_wmma_*.cpp
├── gnhwc/ ← Optional (excluded with MIOPEN_REQ_LIBS_ONLY)
└── ngchw/ ← Optional (excluded with MIOPEN_REQ_LIBS_ONLY)
### 2. Preserved Umbrella Library
As before, all convolution operations are consolidated into a single
static `device_conv_operations` library:
- Aggregates layout-specific instance object files via
`ADD_CONV_LAYOUT_INSTANCES` macro
- **Default build:** Includes all layouts (NHWGC + GNHWC + NGCHW +
NDHWGC + GNDHWC + NGCDHW)
- **MIOpen build (`MIOPEN_REQ_LIBS_ONLY=ON`):** Includes only NHWGC and
NDHWGC layouts
### 3. Binary Size Reduction
When building with `MIOPEN_REQ_LIBS_ONLY=ON`:
**Layouts Included (26 targets):**
- 7× NHWGC instances (2D operations + variants)
- 19× NDHWGC instances (3D operations + variants)

**Layouts Excluded (16 targets):**
- 3× GNHWC instances (2D operations)
- 3× NGCHW instances (2D operations)
- 3× GNDHWC instances (3D operations)
- 3× NGCDHW instances (3D operations)
- 2× GNWC instances (1D operations)
- 1× NWGC instance (1D operations)
- 1× additional NHWGC instance (grouped_conv1d_fwd, not needed by
MIOpen)
This represents a **~38% reduction in instance targets** (16 excluded
out of 42 total
layout-specific targets).

### Testing
-  All existing CK tests link against the umbrella library
-  MIOpen links successfully with the reduced umbrella library
-  Profiler builds with all layout-specific targets explicitly listed

Notes from the Author:

Since this refactor moved most of the convolution files further into
subdirectories, I concentrated on ensuring that no source files were
excluded, including sharded sources: Targets are correctly migrated — no
missing targets, no shard count mismatches.
2026-06-05 15:09:20 +00:00
Yung-sheng Tu
e826b2eb7e [rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

This PR adds two specialisations related to `tf32_t`.

## Technical Details

This change treats `tf32_t` as a concrete type rather than an empty
`struct`. It also adds two new specialisations for MFMA dense builtins
and resolves existing circular include issues.

## Test Plan

All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 12:27:41 +00:00
Sami Remes
ad4e2e7624 [rocm-libraries] ROCm/rocm-libraries#7199 (commit 23f7320)
[CK_TILE] [QuantGEMM] Fix SplitK tail handling and other
 improvements (#7199)

This pull request introduces improved and more robust split-K support
for quantized GEMM. The main changes add runtime validation, utility
functions for split-K batch calculations, pointer offset handling for
split-K in grouped kernels, and enhanced support for various tensor
layouts. The changes also improve error handling and provide more
flexibility for runtime tail handling in split-K pipelines.

**Split-K Support and Validation Enhancements:**

* Added runtime validation to ensure `k_batch` is a positive integer and
that split-K configurations do not produce empty final batches or
mismatched pipeline tails, with detailed error messages and logging for
misconfiguration.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250)
* Introduced utility functions `get_splitk_batch_k_read` and
`get_splitk_last_batch_k` to compute per-batch K read sizes and handle
split rounding, ensuring correct and consistent split-K batch
partitioning.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R206-R234)
[[2]](diffhunk://#diff-635b89bdffa96b2b42f1632520cde36701d7d631e864185591f6b32f7645cf47L104-R107)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L388-R417)
[[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250)
* Changed the default value of `k_batch` in `QuantGemmHostArgs` to 1 (no
split-K) for safer default behavior.

**Pointer Offsets and Grouped Kernel Handling:**

* Updated `QuantGroupedGemmKernel` to apply split-K per-batch offsets to
all input pointers, mirroring the behavior of non-grouped kernels and
ensuring correctness for split-K launches.
* Modified AQ tensor view handling to correctly reflect the remaining
K-groups from the split-K batch's offset position, improving accuracy
for split-K in grouped kernels.

**Pipeline and Layout Flexibility:**

* Added support for runtime selection of split-K tail handling via a new
template parameter `RuntimeSplitKTail_`, with new helper methods to
dispatch GEMM pipelines accordingly.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R273)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1496-R1567)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1427)
[[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1447-R1629)
[[5]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1459-R1641)
* Improved handling for tensor layout cases, including preshuffled B and
both row-major and column-major AQ layouts, ensuring correct pointer
arithmetic and compatibility checks.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R438-R454)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L464-R516)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)
2026-06-05 11:41:49 +00:00
Enrico Degregori
7b9245f18c [rocm-libraries] ROCm/rocm-libraries#5854 (commit 8e2d46d)
[CK Tile] Async support preshuffle GEMM

## Motivation

Add async support to existing preshuffle GEMM pipeline

## Technical Details

Notes:
the implementation avoids previous strategy of duplicating pipelines for
async support and instead add a switch `Async` to the ops Problem to
enable async pipeline. Then, integrate the async pipeline in the
existing one. This allows to avoid code duplication and facilitate the
integration of buffer load to lds in existing pipelines. In my opinion,
it should be used also for other pipelines which don't support buffer
load to lds yet and it would also be a good idea to refactor the
existing async GEMM pipelines with the same approach.

Summary:

 - integrate buffer load to lds in existing pipeline
- add optimal tensor descriptors for vmem loading and lds reading. They
are currently optimized for 16x16 wave tiles but they also work for
32x32 wave tiles. Optimizations for 32x32 wave tile requires different
lds layout and it will be done in a follow-up issue
 - Add async config to examples
 - Add test (gfx950 only)

## Test Plan

New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 07:17:09 +00:00
Aviral Goel
267ca67001 [rocm-libraries] ROCm/rocm-libraries#8028 (commit c1cb112)
[CK_Tile] Add wmma_bf16f32_16x16x32_bf16 via
 fused-downconvert override (#8028)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Adds `__builtin_amdgcn_wmma_bf16f32_16x16x32_bf16` (fp32 accumulate →
bf16 output) to the CK Tile WMMA warp-gemm path. **API only** — the unit
test is split into a stacked PR (#8035) so this API change can be
reviewed in isolation.

## Changes (4 files)

- **16-bit trait:** `wmma_intrinsic_downconvert` (calls the bf16f32
builtin — fp32 C in, bf16 C out) plus `COutDataType = bf16_t` /
`COutVecType`.
- **`WarpGemmAttributeWmmaImpl` / `WarpGemmAttributeWmma`:**
`mac_downconvert(c_fp32, a, b)` (kTransC-aware) returning the bf16
C-output vector.
- **`WarpGemmImpl`:** `mac_downconvert` tail handler producing a bf16
C-output tile from the fp32 accumulator tile, reusing
`CWarpDstrEncoding` (output layout identical to the f32 C tile).

Verified on gfx1250 (via the stacked test PR #8035): the test passes;
the existing WMMA warp-gemm test is unaffected (additive change only).
2026-06-05 05:01:31 +00:00
Enrico Degregori
bdd7a8333d [rocm-libraries] ROCm/rocm-libraries#6672 (commit bda3f97)
[CK Tile] PermuteN support MX GEMM

## Motivation

Add PermuteN support to preshuffle MX GEMM

## Technical Details

 - Modify `shuffle_b_permuteN` to support MX preshuffled layout
- Add `preShuffleScalePermuteN` with same functionality of
`preShuffleScale` but layout consistent with PermuteN
 - Include MX pre-processing functions in the library

## Test Plan

Add test configuration for permuteN with preshuffle (both FP4 and FP8)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Cong Ma <congma13@amd.com>
2026-06-05 03:04:43 +00:00
spolifroni-amd
449f8b4c5b [rocm-libraries] ROCm/rocm-libraries#7955 (commit c87a40f)
[ck] Updated CK Tile documentation to use mermaid diagrams
 (#7955)

## Motivation

There were mermaid diagrams in the CK Tile doc that were converted to
svg. However, there is an extension for mermaid diagrams. The conf.py
and requirements.in have been updated to use that extension instead of
the svg files.
2026-06-04 22:59:52 +00:00
Brock Hargreaves
4e1296674d [rocm-libraries] ROCm/rocm-libraries#7990 (commit b8b5b43)
[CK] Load ck.groovy via Jenkins Shared Library
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

This allows the CI service to have a configuration source-of-truth
outside the PR under test, allowing rapid system changes. Bug fixes on
the develop branch propagate immediately to all pipelines that don't
override the parameter -- no rebase required.

A new `USE_CURRENT_BRANCH_FOR_CK_GROOVY` parameter lets contributors
test pipeline changes on their own branch without any extra
configuration.

## Technical Details

- `loadCk()` in the Jenkinsfile is updated to call
`library("ck@${branch}").ck.get()` instead of `checkout scm` + `load
"vars/ck.groovy"`. The `checkout scm` inside `loadCk()` is removed since
Jenkins now handles the library fetch internally.
- A `USE_CURRENT_BRANCH_FOR_CK_GROOVY` boolean parameter (default: off)
is added. When off, `ck.groovy` is always loaded from `develop` — all
normal PR builds are unaffected. When on, `ck.groovy` is loaded from the
current branch automatically via `env.CHANGE_BRANCH`, so contributors
testing pipeline changes just tick the box.
- `return this` is removed from the end of `ck.groovy`. This was
required by the `load` convention but is not needed (and can cause
errors) in a shared library context.
- `loadCk()` is kept at every call site rather than called once at the
top, preserving restart-from-stage safety — if a build is restarted from
a mid-pipeline stage, `ck` is still initialized correctly.
- The Jenkins Shared Library named `"ck"` must be registered in Jenkins
Global Pipeline Libraries

## Test Plan

1. Trigger "Build with Parameters" on the PR branch with
`USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`
2. Verify "Determine CI Execution" stage completes and the library()
calls indicates the current branch
3. Verify "Static checks" stage completes.
4. Trigger a second build with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=false`
(default) to confirm normal builds still load from `develop`.

## Test Result

Verified both paths. The develop library is loaded by default, the
branch library is loaded when the parameter is enabled.

## Submission Checklist

- [ X ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-04 22:32:37 +00:00
Illia Silin
aef7b42883 [rocm-libraries] ROCm/rocm-libraries#7816 (commit f6324af)
[CK] Fix latest build issues with staging compiler.

## Motivation

Fixing new warnings with staging compiler.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-04 17:41:09 +00:00
John Afaganis
96c39b331e [rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)
[ck] Enforce ASCII-only C/C++ sources for hipRTC
 compatibility (#7829)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

CK source files must be compilable via **hipRTC (HIP runtime
compilation)**, whose preprocessor does not accept non-ASCII bytes
anywhere in a translation unit — **including in comments**. Bytes that
are harmless under `hipcc` (em-dashes, smart quotes, multiplication
signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at
preprocessing time. These regularly leak in via LLM-assisted authoring
or copy/paste from formatted documents and silently break hipRTC paths
that are not exercised by the default `hipcc`-based build matrix.

This PR (a) cleans every existing violation (53 files) and (b) adds a
pre-checkin gate so new violations are rejected before merge.

## File extensions covered

Both the cleanup scan and the new Jenkins enforcement stage use the same
predicate:

```
*.h  *.hpp  *.cpp  *.h.in  *.hpp.in  *.cpp.in  *.inc  *.cl
```

(excluding `*/build/*` and `*/include/rapidjson/*`). This is a strict
superset of the existing `Clang Format` stage's predicate — `*.inc` is
added so test-fixture include files are also gated. The local pre-commit
hook's `c++/inc` type filter covers the same set.

## Why no enforcement today

CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so
the existing `pre-commit` workflow doesn't touch CK. The local CK
`.pre-commit-config.yaml` only runs for developers who installed hooks.
The **authoritative gate is therefore the new Jenkins stage** in this
PR; the local hook is convenience.

## Commit layout (bisect-friendly)

1. `79798aa6261` — **`[ck] Convert reflect/ rendering to ASCII for
hipRTC compatibility`**
Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `|-
/ +- / | ` (3-col width preserved so alignment is unchanged).
`conv_description.hpp` swaps `×` for `x` as the dimension separator.
`test_conv_description.cpp` expected strings updated in lockstep so the
snapshot test stays green. This is the only commit in the series with
observable runtime impact.

2. `738fdb0d81c` — **`[ck] Strip non-ASCII bytes from C++ sources for
hipRTC compatibility`**
Mechanical text cleanup across 53 files. Replacements happen in comments
or in `std::cout` strings that are not asserted on by any test. None of
the 174 `.inc` files in the tree required edits, but they were in the
scan's predicate so the enforcement stage's predicate is a superset of
what was scanned. Full replacement table in the commit message.

3. `1d7cd8ba235` — **`[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility`**
- New `projects/composablekernel/script/check_ascii_only.sh` (modeled on
`check_copyright_year.sh`).
- New entry in `projects/composablekernel/.pre-commit-config.yaml` under
the local-hooks block (`types_or: [c++, inc]`).
- New `ASCII Only Check` parallel stage in
`projects/composablekernel/Jenkinsfile`'s `Static checks` block,
mirroring the existing `Clang Format` stage but with `*.inc` added to
the find predicate. Always-on, no `RUN_CPPCHECK` gate.

The tree is buildable at every commit boundary. Commit 1 leaves 50 known
violations; commit 2 leaves 0; commit 3 wires the gate.

## Demo

Script output on a synthesized violation:

```
$ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp
$ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp
ERROR: /tmp/bad.cpp contains non-ASCII bytes:
1:// em-dash test — here
  Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.)
$ echo $?
1
```

Full repo scan after the cleanup commits (note the `-name '*.inc'`
clause):

```
$ cd projects/composablekernel && find . -type f \( -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \
    -o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' \) \
    -not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \
  | xargs -0 -P 8 -n 64 script/check_ascii_only.sh
$ echo $?
0
```

## Test plan

- [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check`
stage runs green over the full predicate (incl. `*.inc`) and existing
`Clang Format` stage is unaffected.
- [ ] `test_conv_description` passes against the ASCII tree-formatter
output (touched in commit 1).
- [ ] Local: `pre-commit run ascii-only-checker --all-files` runs
cleanly after installing CK pre-commit hooks via
`script/install_precommit.sh`.
- [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file,
push: confirm Jenkins fails the new stage with a clear error.
- [ ] Spot-check a representative subset of touched files under hipRTC
compilation to confirm no remaining hipRTC-blocking content (optional,
since the static byte check is a sufficient condition for hipRTC
preprocessor acceptance on this dimension).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-04 15:00:17 +00:00
Copilot
4fcd73a98e [rocm-libraries] ROCm/rocm-libraries#7974 (commit 9df2c76)
composablekernel: remove stray *.hpp.bk backup artifacts
 (#7974)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Four `*.hpp.bk` files were accidentally committed to
`projects/composablekernel/`, likely as leftovers from a prior merge or
conflict resolution. Each is an older snapshot of its `.hpp` counterpart
— the canonical `.hpp` files are newer and contain the correct current
content.

## Deleted files

| File | vs. `.hpp` counterpart |
|---|---|
| `ck_tile/core/tensor/tile_window.hpp.bk` | Older version: uses legacy
`bool isL1Cache`/`PrefetchL1` template params; missing
`DataCachePrefetchKind`-based prefetch API and `data_cache_prefetch.hpp`
include |
| `ck_tile/core/tensor/load_tile_transpose.hpp.bk` | Older version:
missing `#if defined(__gfx950__)` guard and `Quad` struct (~90 lines)
for gfx1250 architecture |
| `ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp.bk` | Older version:
missing `WmmaTag`, `IsScale16` template param, and several newer
dispatcher specializations |
|
`ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp.bk`
| Older version: `KPackA`/`KPackB` (since renamed `KPack`); uses
`static_ford` (since refactored to nested `static_for`) |

## Verification

- No other `.bk` files exist in `projects/composablekernel/`.
- No build scripts, CMake files, includes, or documentation reference
these `.bk` files.
- No `.hpp` files were modified.
2026-06-04 03:06:43 +00:00
apophis
42c82b093e [rocm-libraries] ROCm/rocm-libraries#7786 (commit 7842dfd)
[CK TILE][Windows] add `msvc::no_unique_address` support for
 Windows (#7786)

## Motivation

While building Flash Attention 2 with CK backend, this warning will spam
in every kernel:
```
DEBUG [1/1837] hipcc.exe ...
DEBUG In file included from H:\ROCm\flash-attention\build\fmha_fwd_d32_bf16_batch_b64x64x16x32x32x32_r4x1x1_r4x1x1_w16x16x16_w16x16x16_qr_vr_pssk_nlogits_alibi_mask_lse_ndropout_nskip_nqscale_ntrload_nsink_gfx12.cu:6:
DEBUG In file included from H:\ROCm\flash-attention\csrc\composable_kernel\example\ck_tile\01_fmha\fmha_fwd.hpp:6:
DEBUG In file included from H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core.hpp:111:
DEBUG H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core/tensor/tile_scatter_gather.hpp:1246:7: warning: unknown attribute 'no_unique_address' ignored [-Wunknown-attributes]
DEBUG  1246 |     [[no_unique_address]] std::conditional_t<kUseGlobalLoad_, PageIdxArray, gl_field_empty_t>
DEBUG       |       ^~~~~~~~~~~~~~~~~
DEBUG H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core/tensor/tile_scatter_gather.hpp:1254:7: warning: unknown attribute 'no_unique_address' ignored [-Wunknown-attributes]
DEBUG  1254 |     [[no_unique_address]] std::conditional_t<kUseGlobalLoad_, index_t, gl_field_empty_t>
DEBUG       |       ^~~~~~~~~~~~~~~~~
DEBUG 2 warnings generated when compiling for host.
...
```

## Technical Details

`[[no_unique_address]]` is not working on Windows LLVM, should use
`[[msvc::no_unique_address]]`.

## Test Plan

Build FA2 with CK backend.

## Test Result

No warnings, no errors.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-06-04 02:28:12 +00:00
Aviral Goel
e01603bc31 [rocm-libraries] ROCm/rocm-libraries#7725 (commit eef7e12)
[GFX1250][CK_TILE] Add scale16 warp gemm unit tests

## Summary
- Add scale16 WMMA intrinsic overloads and int64_t forwarding to warp
gemm layers for gfx1250
- Add comprehensive wave-level unit tests for scale16 warp gemm
(16x16x128 and 32x32x128 tile sizes)
- Test all fp8/bf8 type combinations and TransposeC variants
- Fix WarpGemm wrapper for non-uniform scale16 configurations

Stacked on #7724 (FillUniformScaleDistribution / MX GEMM scale init).
Pipeline enablement follows in the next PR.
2026-06-03 22:05:29 +00:00
Bartłomiej Kocot
45a8f96c66 [rocm-libraries] ROCm/rocm-libraries#7943 (commit 944adfd)
[CK] Grouped conv profiler updates

## Motivation

Reduce profiling time for no verification.

## Technical Details

Remove not needed code for no verification

## Test Plan

test_grouped_convnd*

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1230
2026-06-03 15:00:36 +00:00
chris-tsiaousis-hpc
db05d61136 [rocm-libraries] ROCm/rocm-libraries#6212 (commit ccee58d)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

This PR solves several issues:

#### More accurate tests for MmaPipelines

The current tests for the MmaPipelines (test_amdgcn_sparse_mma,
test_amdgcn_wavewise_mma) use explicit input fragment vectors filled
with 1s, and only check the output of a single lane. We should have
tests that actually use the MmaPipelines with non-trivial input matrices
and verify the complete output.
Some other aspects of the current MmaPipelines tests that I noticed and
deserve some attention:

1. There is sometimes iteration over K outside of the pipeline, which is
then included in WaveTileK or FragK, which is not correct. We should
remove it, move K iteration inside of the pipeline, or be more clear
about this outer-K loop size and how it propagates downwards.
2. There is very tight coupling between the kernel, gtest code, and
test_pipeline helper, requiring a lot of information and functions to be
passed back and forth.
3. The test_pipeline helper is doing a bunch of register-related logic
on the host (related to point 1)
4. Without this register logic the only thing it does is check the
device, call the kernel, and check the output, but with a lot of
boilerplate.

#### Test helper for detecting target arch at HOST runtime

There is a really apparent issue we faced while writing tests:

Scenario:
1. Compile a test that supports both gfx950 and gfx1201 for gfx950
2. Run the test on a server that only has gfx1201 GPU

Actual:
Segmentation fault

Expected:
The test can correctly detect from HOST runtime that the DEVICE
target_id was different and skips the test.

Notes:

The only way of detecting the COMPILER_TARGET_ID in the existing "arch"
framework is launching a kernel and calling `get_compiler_target()` (so,
from a DEVICE code). This will create a segmentation fault if the
current arch differs from the target arch. To cope with this issue, we
propose to export the compiler target(s) (note they can be many) through
`projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and
define a test helper to deal with such cases.

#### Add composition support to Transforms

We have a small number of Transforms which act on MmaOp input and output
data, before and after the MmaOp call respectively. These are currently
implemented to work on an MmaTile level, but in theory they are also
supposed to work at a WaveTile level, i.e. after composition of multiple
MmaTiles to create larger effective MNK dimensions. Currently the
composed MmaTiles look like 2D C-style arrays of the individual MmaTile
level register vectors (see WaveWiseMmaPipeline). The transforms should
be able to take these and perform the proper transforms to the whole
WaveTile at once. This might allow for better performing
transformations.

Note: This PR handles the SparseTransform case and if we don't end up
doing scale as a transformation, there isn't really much left to do. If
we end up having only the sparse transform as a non-trivial transform,
then we could also consider removing the Transform framework.
2026-06-03 14:35:18 +00:00
Ville Pietilä
88f8d24c34 [rocm-libraries] ROCm/rocm-libraries#7936 (commit 3dc91e6)
[CK Tile] Fix V6 pipeline applicability and split-image
 initialization (#7936)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

After adding code generation via CK Tile Dispatcher, some fwd and bwd
weight tests for CK Tile convolutions are failing. This PR introduced
correct applicability checks and fixes the split-image parameter
initialization such that non-applicable instances are not invoked during
test execution and split-image instances are correctly initialized.

## Technical Details

Investigation revealed two distinct problems

1. For bwd weight, the compute V3 uses prefetch of 3 distinct tiles,
which works incorrectly when the number of K-slices addressed by the
workgroup is 1. This occurs when a large split-K value is used for a
problem that results in a small Gemm-K value.
2. For fwd direction, the current CK Profiler/test infrastructure
doesn't initialize the split-image parameters for instance where
split-image is enable. Uninitialized split-image values result in
non-deterministic behavior where the tests might randomly fail.

Fixed problem 1. by adding a check in `IsSupportedArgument` that marks
the instance invalid if the `num_loops = ceil(GemmK / (k_batch *
KPerBlock)) < 4` for V6 pipeline kernel instances. The check is
compile-time eliminated for other kernels.

Fixed problem 2. by adding initialization of split-image parameters when
split-image is enabled. The default initialization corresponds to full
image with no split, i.e., the number of splits is 1 and it has the size
of the full image.

Added unit tests for the added logic.

## Test Plan

Running the following test suites cover the logic added in this PR
- test_grouped_convnd_fwd_tile
- test_ck_tile_grouped_conv_fwd
- test_grouped_convnd_bwd_weight_tile
- test_ck_tile_grouped_conv_bwd_weight

All test suites above are included in the automated test runs.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-03 08:40:03 +00:00
Anton Gorenko
7ecbf82708 [rocm-libraries] ROCm/rocm-libraries#7500 (commit f5cd4fd)
[CK_TILE][FMHA] Optimize long-context decoding on gfx11/12
 (#7500)

## Motivation

Relevant issue: ROCM-22065

FMHA has less-than-optimal performance of long-context decoding (i.e.
when seqlen_q = 1) on gfx11/12.
This PR optimizes the splitkv pipeline and configs for such scenarios.

## Technical Details

Optimizations applied in this PR:
1. use tiles with smaller M0 (16 vs 64), these tiles are used when
seqlen_q <= 16
2. adapt qr_nwarp_sshuffle pipeline for gfx11, it allows to use more
warps even for M0 = 16 (the qr pipeline parallelizes work between warps
in M dim so with M0 = 16 it allows to use only 1 warp)
3. enable kMergeNumHeadGroupsSeqLenQ (an optimization that merges one
group of heads in GQA) for all hdim values, not only 128
4. increase the number of splits (multiply by the number of head groups)
if (3) is used
5. increase the number of splits for RDNAs (`multiProcessorCount` is the
number of WGPs on RDNAs, not CUs, so it should be doubled to have
meaning similar to CDNAs)

Performance on gfx1151:

| Case | develop (GB/s) | This PR (GB/s) |
|:-------|-------:|-------:|
| [fp16\|group\|bshd] b:1, h:32/32, s:1/45056, d:64/64 | 127.58 | 183.11
|
| [fp16\|group\|bhsd] b:1, h:32/32, s:1/45056, d:64/64 | 153.64 | 215.02
|
| [fp16\|group\|bshd] b:1, h:16/8, s:1/77184, d:128/128 | 120.51 |
225.76 |
| [fp16\|group\|bhsd] b:1, h:16/8, s:1/77184, d:128/128 | 130.62 |
223.84 |
| [fp16\|group\|bshd] b:1, h:32/32, s:1/9600, d:128/128 | 82.65 | 138.44
|
| [fp16\|group\|bhsd] b:1, h:32/32, s:1/9600, d:128/128 | 105.75 |
220.45 |
| [fp16\|group\|bshd] b:1, h:8/1, s:1/401024, d:256/256 | 16.27 | 187.89
|
| [fp16\|group\|bhsd] b:1, h:8/1, s:1/401024, d:256/256 | 16.28 | 188.19
|

## Test Plan

An additional test case is added to the exiting test. It uses seqlen_q =
1, GQA, no mask to trigger the changes
```
ninja test_ck_tile_fmha_fwd_fp16 && bin/test_ck_tile_fmha_fwd_fp16 --gtest_filter="*SplitKV*
ninja test_ck_tile_fmha_fwd_bf16 && bin/test_ck_tile_fmha_fwd_bf16 --gtest_filter="*SplitKV*
```

Manual testing can be done with these commands:
```
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=64  -s=1 -s_k=$((352 * 128))  -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=16 -h_k=8  -d=128 -s=1 -s_k=$((603 * 128))  -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=128 -s=1 -s_k=$((75 * 128))   -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=8  -h_k=1  -d=256 -s=1 -s_k=$((3133 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
```

## Test Result

All the tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-03 06:16:10 +00:00
Yi DING
01bd52bdb5 [rocm-libraries] ROCm/rocm-libraries#7925 (commit a8f0845)
[CK] Fix gfx950 AITER Sync Regressions
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Fixes three gfx950 regressions in the AITER downstream CI that surfaced
after the internal/gfx1250 re-sync (ROCm/rocm-libraries#6978):

> **Companion aiter PR:** ROCm/aiter#3392 — host-side adaptations
(`Kernel::BlockSize()` `constexpr` drops, blockscale `KBatch=1` clamp)
plus the CK submodule bump used to validate these fixes together.

- **FlyDSL MoE AOT cache miss** — the AITER MoE tests run with
`check_aot_cache=True` and fail on any FlyDSL JIT cache miss, but the CI
never pre-compiles the FlyDSL MoE kernels, so gfx950 always misses.
Pre-compile them at the start of the AITER test stage.
- **`buffer.load.lds.v4i32` link error** — ROCm/rocm-libraries#6978
reintroduced a clang-version guard mapping
`llvm.amdgcn.raw.buffer.load.lds` to a `.v4i32`-suffixed name. That name
exists in no LLVM (the rsrc operand is a fixed, non-overloaded `<4 x
i32>`, so the intrinsic is never type-mangled), so gfx950 4-DWORD
direct-to-LDS (e.g. fp4 MoE bpreshuffle) fails to link with `lld:
undefined symbol: llvm.amdgcn.raw.buffer.load.lds.v4i32`. Use the
canonical plain name unconditionally.
- **mixed-precision flatmm warp-GEMM call** — ROCm/rocm-libraries#6978
generalized the scaled `WarpGemmImpl::operator()` from a fixed `<index_t
opselA, index_t opselB>` signature to a variadic `<typename... Params>`
one and updated the `mx_flatmm` pipeline to pass the op-selectors as
`OpSelA<>`/`OpSelB<>` types, but missed the mixed-precision flatmm
pipeline (`F8xMXF4`/`F16xMXF4`), which still passed raw integer
op-selectors. These no longer bind to `typename... Params` (`error: no
matching member function for call to 'operator()'`), breaking
compilation of the fp8/bf16 × fp4 cktile MoE gemm1 instances on gfx950
(aiter `test_moe_2stage`). Wrap the op-selectors in
`OpSelA<>`/`OpSelB<>`.

## Changes

- `Jenkinsfile`: pre-compile the FlyDSL MoE AOT cache (`python3
aiter/aot/flydsl/moe.py`) before the AITER tests.
- `include/ck/utility/amd_buffer_addressing_builtins.hpp` and
`include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp`: drop the
`__clang_major__` guard and always use
`__asm("llvm.amdgcn.raw.buffer.load.lds")`. The plain name is the
canonical one for all sizes including the gfx950 16-byte form, as the
upstream LLVM gfx950 tests confirm.
-
`include/ck_tile/ops/flatmm/pipeline/mixed_prec_flatmm_pipeline_agmem_bgmem_creg_v1.hpp`:
wrap the warp-GEMM op-selectors in `OpSelA<>`/`OpSelB<>` at the five
call sites, matching the `mx_flatmm` pipeline.

## Test plan

Validated via CI.
2026-06-03 02:09:05 +00:00
Illia Silin
5720589311 [rocm-libraries] ROCm/rocm-libraries#7960 (commit ddac5cf)
[CK] Upgrade to new gfx1250 compiler and fix build issues
 (#7960)

## Motivation

The docker image we've been using to build for gfx1250 is a few months
old, so we need to upgrade. Some of the changes in the latest compiler
version require changes in the code. TDM is temporarily disabled due to
changes in the lds load/store intrinsics.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-03 01:58:59 +00:00
Maksim (Max) Podkorytov
d574cc4757 [rocm-libraries] ROCm/rocm-libraries#6696 (commit 9627b91)
Replace nested static_for lambdas with compile-time search
 helper (#6696)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Add `sequence_find_value` and `find_in_tuple_of_sequences`
compile-time search helpers with O(1) template depth
- Replace nested `static_for` lambdas in
`TensorDescriptor::GetTransformAndItsUpperDimension` and
`InitializeElementSize`
- Apply same optimizations to `TensorAdaptor`

Supersedes #4287. Conflict-resolved rebase of
ROCm/composable_kernel#3600 onto current develop.

## Motivation

The `TensorDescriptor` and `TensorAdaptor` classes had excessive
template instantiation from:
1. Nested `static_for` loops with lambdas creating unique closure types
at every call site
2. `generate_tuple` with lambdas causing per-type instantiation overhead

The new helpers use constexpr array lookup and pack expansion instead of
recursive template patterns, achieving O(1) template depth.

## Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved,
`-j1`, `-ftime-trace`)

| TU | Baseline (mean) | New (mean) | Delta | Wilcoxon p | Mann-Whitney
p |

|----|-----------------|------------|-------|-----------|---------------|
| `grouped_conv_fwd_xdl_fp16` (host) | 14,886 ms | 13,353 ms |
**-10.3%** | **0.002** | **0.0002** |
| `grouped_conv_fwd_xdl_fp16` (device) | 27,762 ms | 25,629 ms |
**-7.7%** | **0.002** | **0.0002** |
| **Total (all TUs)** | **57,732 ms** | **54,030 ms** | **-6.4%** | | |

Unrelated TUs (`device_memory`, `host_tensor`, `convolution_parameter`)
show no significant difference (p > 0.3), serving as negative controls.

### Methodology

- 10 interleaved runs (baseline₁, new₁, baseline₂, new₂, ...) on the
same node to eliminate ordering/warmup bias
- Wilcoxon signed-rank test (paired, non-parametric) and Mann-Whitney U
test (unpaired)
- Built with patched clang (LLVM 22) on ctr2-alola-compile-11, `-j1` for
accurate per-TU timing
- Raw data available in Slurm job 275230 results

## Test plan

- [x] 11 unit tests added (5 for `sequence_find_value`, 6 for
`find_in_tuple_of_sequences`)
- [x] Compile-time benchmark with statistical significance (p < 0.01)
- [ ] Full CI

Tracking issue: #4229
2026-06-02 23:15:10 +00:00
Aviral Goel
99ab4c4ef7 [rocm-libraries] ROCm/rocm-libraries#7830 (commit 590fe58)
[CK_Tile][MI450] Add bf16 output wmma instruction (16x16x32)
 (#7830)

Wire __builtin_amdgcn_wmma_bf16_16x16x32_bf16 into CK Tile for gfx1250,
enabling bf16-input bf16-output WMMA at the warp GEMM level.

- Add WmmaTraits specialization for <gfx125_t, bf16, bf16, bf16,
16,16,32>
- Add WarpGemmAttributeWmmaImpl typedef and WarpGemmWmma alias
- Add Dispatcher entry for bf16->bf16 16x16x32
- Add warp_gemm test with reference GEMM validation

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-02 13:54:16 +00:00
Sami Remes
919096fde8 [rocm-libraries] ROCm/rocm-libraries#7935 (commit 5c96097)
[CK] Allow skipping split-K C-buffer zero-init in
 xdl_cshuffle blockscale GEMM (#7935)

Add a `skip_zero_init` flag (default false) to the Problem/Argument of
the xdl_cshuffle block-scale GEMM device ops (multiple_d ab_scale and
blockscale b-preshuffle). When the flag is set, the device invoker skips
the internal hipMemsetAsync that zeroes p_c_grid before the KBatch > 1
split-K atomic-accumulation path. The flag is declared on the gridwise
Problem struct (inherited by Argument), so it is visible on both the
rotating-cache (arg_) and the normal (arg) launch paths in each device
op.

Why: callers that already pre-zero the output buffer otherwise pay for a
redundant device-wide memset before split-K atomic accumulation. Gating
the memset behind an opt-in flag lets such callers avoid the duplicate
work. Because the flag defaults to false, every existing call site is
unaffected and the observable behavior is unchanged.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-02 13:08:46 +00:00
Johannes Graner
b7c8fb164f [rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d)
[CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight
 instances (#7937)

## Motivation

CK Tile had no pipeline competitive with old CK's wavelet on the
RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a
wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive
kernel for spatial small-K shapes.

## Technical Details

- New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`):
workgroup split into math waves (LDS read + MFMA) and load waves (DRAM
read + LDS write).
- VGPR role-split: `operator()` has two top-level mutually-exclusive
`is_math` branches so the allocator overlays both roles onto the same
physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy.
Correctness depends on identical `block_sync_lds` counts on both arms
plus a matching load-wave barrier stub in the epilogue
(`cshuffle_epilogue.hpp`).
- Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`):
`kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub.

Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36
C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27%
(googlenet K=320 by 16-23%); the role-split roughly halves the parity
gap vs old CK on the 13x13 fp16 shape.

## Test Plan

- `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16,
`split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a
broad 2D sweep.
- Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64}
(barrier-parity / deadlock check).
- `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet
instances.

## Test Result

- All wavelet instances CPU-verify correct across the split-K sweep; no
hangs (dual-arm barrier sequence matches).
- Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best
non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better
vs old CK on the majority of spatial shapes.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-02 08:51:17 +00:00
Brock Hargreaves
843d993835 [rocm-libraries] ROCm/rocm-libraries#7743 (commit 15ef85c)
[CK] Extract Jenkinsfile helpers into vars/ck.groovy shared
 library (#7743)

## Motivation
The CK Jenkinsfile is a 2,215-line monolith mixing helper function
definitions with pipeline stage declarations. This makes it difficult to
review, modify, or extend CI stages without wading through unrelated
infrastructure code.

## Technical Details
Extract all helper functions from the Jenkinsfile into vars/ck.groovy,
loaded at runtime via ck = load "vars/ck.groovy" in the first stage. The
Jenkinsfile is reduced from 2,215 lines to 810 lines containing only the
pipeline structure.

- 36 helper functions moved to ck.groovy with no logic changes
- 10 new stage-wrapper functions (runBuildCKAndTests,
runTileEngineGemmTests, runClangFormat, etc.) extract inline
environment{}/steps{} business logic from stages, eliminating the
MethodTooLargeException caused by CPS-transformed shell strings
exceeding the JVM 64KB bytecode limit
- All ck. method calls in steps{} blocks wrapped in script{} as required
by Jenkins Declarative Pipeline
- rocmnode() remains in the Jenkinsfile (needed for agent{} labels
before ck is loaded)
- CRON_SETTINGS / POLL_SPEC remain in the Jenkinsfile (triggers{}
evaluates at parse time before any workspace is available)
- No stage names changed

## Test Plan
- Jenkinsfile validated against the Jenkins Pipeline Linter
(/pipeline-model-converter/validate)
- All 35 shared helper functions diffed line-by-line against develop to
verify no regressions
- Merge from develop incorporated and verified (gfx1250 stage, ROCm 7.13
default, cmake_build updates)

## Test Result
- Linter: passes
- Function diff vs develop: all 35 functions match exactly
- Awaiting Jenkins run to confirm end-to-end stage execution

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-01 21:10:12 +00:00
Chao
c56c6750d0 [rocm-libraries] ROCm/rocm-libraries#6498 (commit 5961a2e)
[CK_TILE] Fix conditional rescale numerical instability in
 FMHA forward (#6498)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[CK_TILE] Fix conditional rescale numerical instability in FMHA forward

## Motivation

Fix numerical instability in the conditional O-accumulator rescaling
optimization
for CK-Tile FMHA forward (FlashAttention-4, Algorithm 6, Eq. 6).

The conditional rescale optimization skips the expensive O-accumulator
rescale when
the running row-max shift is within a threshold (tau = log2(256) = 8.0).
The original
implementation had a bug: attention weights P were computed in the
`m_new` reference
frame before the skip/rescale decision. In the skip branch, `m` was
reverted to
`m_old`, but P remained in the `m_new` frame, causing incorrect softmax
normalization.

This fix introduces a `p_row_correction` factor: in the skip branch, P
is multiplied
by `exp2(m_new - m_old)` to bring it back to the `m_old` reference
frame.

- **Correctness:** Fixes broken inference on long sequences where
running-max drift
causes exp2 overflow (observed as degraded image quality on MI350X Flux2
generation)
- **Performance:** Neutral to +4% depending on workload shape

## Technical Details

6 pipeline header files (same pattern in each):
- `block_fmha_pipeline_qr_ks_vs.hpp`
- `block_fmha_pipeline_qr_ks_vs_async.hpp`
- `block_fmha_pipeline_qr_ks_vs_async_trload.hpp`
- `block_fmha_pipeline_qr_ks_vs_fp8.hpp`
- `block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp`
- `block_fmha_pipeline_qs_ks_vs.hpp`

In each file:
- Lower threshold from 10.0 to 8.0 (tau = log2(256))
- Add `p_row_correction` distributed tensor initialized to 1.0
- Rescale branch: standard rescale of O_acc and l; correction = 1.0
- Skip branch: compute correction = exp2(-acc_scale_log2), update l,
revert m, store correction
- New `p_spans` sweep applies per-row correction to `p_compute` before
P*V GEMM
- Move P-to-PDataType cast to after correction sweep

## Dependencies

None — this PR is standalone.

## Test Plan

- GPU validation on MI300X (gfx942, ROCm 6.4.1):
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128
-prec=bf16 -v=1 -warmup=1 -repeat=3`
- GPU validation on MI350X (gfx950, ROCm 7.0):
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128
-prec=bf16 -v=1 -warmup=1 -repeat=3`
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128
-prec=fp16 -v=1 -warmup=1 -repeat=3`

## Test Result

Accuracy vs FP32 reference (MI350X, gfx950):

| Shape | max_diff | mean_diff |
|-------|----------|-----------|
| B=1 H=24 M=4096 K=128 bf16 | 9.1e-4 | 4.6e-5 |
| B=4 H=32 M=4096 K=128 bf16 | 9.9e-4 | 4.6e-5 |
| B=1 H=24 M=4096 K=128 fp16 | 1.2e-4 | 9.0e-6 |

Performance (MI350X, gfx950, ROCm 7.0):

| Shape | FA4 (TFlops) | Always-rescale (TFlops) | Delta |
|-------|-------------|------------------------|-------|
| B=1 H=24 M=4096 K=128 bf16 | 425.9 | 428.5 | neutral |
| B=2 H=8 M=2048 K=256 bf16 | 513.9 | 509.0 | +1.0% |
| B=1 H=64 M=2048 K=64 bf16 | 481.7 | 464.3 | +3.7% |

Benchmark results (MI300X, gfx942, ROCm 6.4.1):

No regression on MI300X. This correctness fix is performance-neutral.

| Config | TFlops / GB/s | Time (ms) |
|--------|-------------|-----------|
| MHA bf16 b=2 h=8 s=4096 d=128 | 342.49 TFlops | 0.401 |
| MHA fp16 b=2 h=8 s=4096 d=128 | 391.70 TFlops | 0.351 |
| Causal MHA bf16 b=2 h=8 s=4096 d=128 | 227.07 TFlops | 0.303 |
| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 | 324.69 TFlops | 0.423 |
| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 | 348.09 TFlops | 0.790 |
| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 | 376.71 TFlops |
1.459 |
| Long-seq bf16 b=1 h=16 s=16384 d=128 | 383.42 TFlops | 5.735 |
| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 | 691.64 GB/s | 1.554 |

All validation tests pass (`valid:y`) on both MI300X and MI350X.

Additional validation:
- Uniform scores: softmax output matches FP32 reference (max_diff <
1e-3)
- Large seqlen (4096+): no overflow or NaN in O-accumulator
- Spike pattern: correct handling of sudden row-max jumps
- Multiple spikes: correction applied correctly across multiple
skip/rescale transitions
- Deterministic: identical outputs across repeated runs
- No performance regression on standard workloads
2026-05-30 10:34:06 +00:00
Tianyuan Wu
22a99f97e8 [rocm-libraries] ROCm/rocm-libraries#7677 (commit 308af93)
[CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile

## Motivation
This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and
improves warp GEMM unit test coverage/structure for gfx1250-specific
cases.

## Technical Details

- Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added
IsScale16 plumbing to warp GEMM dispatcher path
- Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250
coverage to verify all F4 WMMA paths

## Test Plan
Run ./test_ck_tile_wg_32x16x128_fp4.

## Test Result
```
./test_ck_tile_wg_32x16x128_fp4
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1751 ms total)
[  PASSED  ] 3 tests.
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-30 01:28:48 +00:00
Illia Silin
8d97265896 [rocm-libraries] ROCm/rocm-libraries#7863 (commit 0845ce7)
[CK] apply the compiler warning suppression flags in cmake
 files (#7863)

## Motivation

Apply the blanket suppression flags for latest clang warnings in staging
compiler such as:
lifetime-safety-lifetimebound-violation
lifetime-safety-intra-tu-suggestions
lifetime-safety-cross-tu-suggestions
unknown-warning-option

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-30 00:15:12 +00:00
Hosang Yoon
e7e8801dc3 [rocm-libraries] ROCm/rocm-libraries#7586 (commit c18f2c7)
[CK_TILE] Use gfx11 float buffer atomics in FMHA Bwd

## Motivation

FlashAttention CK backward on gfx11 can hit out-of-bounds/tail writes in
the dQ accumulator atomic-add path when sequence rows are padded at the
tile level but not marked invalid in the DQDKDV main tensor view.

With the generic global atomic fallback, an incorrectly-valid tail
element can issue an actual pointer-based `atomicAdd`. With the buffer
atomic path, the write is issued through a buffer resource with bounds
information and follows the same backend already used by gfx9/gfx12.

This fixes the gfx11 FMHA BWD failure without changing the gfx11 default
for unrelated CK Tile kernels.

## Technical Details

This PR enables the existing CK Tile AMD buffer float atomic-add path
only for generated FMHA BWD gfx11 translation units.

gfx11 normally uses the generic global atomic fallback for
floating-point `buffer_view::atomic_add`. That fallback performs the
atomic through a raw computed pointer and depends on the software
validity predicate to avoid invalid elements. In FMHA BWD dQ
accumulation, padded tail rows can reach this path, so using the buffer
atomic backend is safer: it uses a buffer resource with base pointer,
bounds information, and an element offset, matching the backend already
used by gfx9/gfx12.

Enabling `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT` globally for gfx11 is
too broad and can break unrelated gfx11 CK builds such as GEMM. Instead,
`config.hpp` now preserves an explicitly pre-defined
`CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT`, while keeping the existing
default disabled for gfx11.

## Test Plan

Validated the change with the FlashAttention CK full test suite with
backward pass enabled on gfx11.
pytest -q -s tests/test_flash_attn_ck.py

## Test Result

FlashAttention CK gfx11 test result:
260680 passed, 152076 skipped

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-05-30 00:10:26 +00:00
Emily Martins
95c916369c [rocm-libraries] ROCm/rocm-libraries#7584 (commit 060bad5)
[CK_TILE] Fix Stream-K k_size calculation

## Motivation

In a recent benchmarking task for CK Tile Stream-K algorithm, we
identified that certain instances segfault. This change works to fix the
bug and adds necessary regression tests.

## Technical Details

The StreamK kernel constructs tensor views using a `k_size` parameter
that determines how much of the K dimension to process in each
iteration. Previously, this was calculated as:
 ```cpp
index_t k_size = num_loop_sk * TilePartitioner::KPerBlock;
```
This calculation assumes all macro tiles along K are exactly `KPerBlock` in size. However, when `K % KPerBlock != 0`, the final macro tile along K has a remainder size of `K % KPerBlock`, not a full `KPerBlock` (see the figure below):
<img width="961" height="488" alt="image" src="https://github.com/user-attachments/assets/3e1cceed-5dcd-4980-8b02-cee24eecf262" />
With the old code, a workgroup working with the `MPerBlock x (K % KPerBlock)` tile in A and B risk accessing illegal memory.

Hence, this change ensures that when `K % KPerBlock != 0`, workgroups processing iterations that include the final macro-tile along K calculate the correct `k_size` based on the remainder rather than assuming a full `KPerBlock`.

## Test Plan
I added the following tests:
1. Unit tests added for the Stream-K Tile Partitioner:
- `StreamKTilePartitionerBaseGetKSize/NoRemainderTiles` - validates full tiles
- `StreamKTilePartitionerBaseGetKSize/RemainderTiles` - validates remainder handling
2. Regression tests that test a case where `K % KPerBlock != 0`

## Test Result

Tests passed locally on gfx90a, gfx942, and gfx950.

## Submission Checklist

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 21:36:49 +00:00
Illia Silin
0edfcf06e5 [rocm-libraries] ROCm/rocm-libraries#7894 (commit 5e66689)
[CK] add credentials to docker manifest inspect call

## Motivation

This should fix an issue that we recently encountered in CI when we
exceeded the limit of accessing docker without authentication:

[2026-05-29T16:08:42.447Z] + docker manifest inspect --insecure
rocm/composable_kernel:ck_ub24.04_rocm7.13
[2026-05-29T16:08:42.833Z] toomanyrequests: You have reached your
unauthenticated pull rate limit.
https://www.docker.com/increase-rate-limit

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 19:18:57 +00:00
Aviral Goel
15c904b460 [rocm-libraries] ROCm/rocm-libraries#7724 (commit 4cb149a)
ck_tile: add FillUniformScaleDistribution and fix MX GEMM
 scale init (#7724)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

### Problem
MX GEMM pipeline tests were passing vacuously: scale bytes were drawn
from a fixed range (40–60) which, for e8m0, maps to scales ≈ 10⁻²⁷ — far
below FP16 min denorm. Both GPU and CPU produced all-zero outputs, so
numerical checks passed without exercising the GEMM.

### Changes

**`include/ck_tile/host/fill.hpp`** — new
`FillUniformScaleDistribution<ScaleType>` functor
- Accepts human-readable float bounds and maps them to the raw byte
range of any ExMy scale type (e8m0, e4m3, e5m3) by re-centering the IEEE
754 exponent into the type's bias space
- Sampling is uniform over raw bytes → uniform over representable values
- Fixes left-shift UB: uses multiplication instead of `<< mant_bits` to
avoid shifting negative signed integers (C++17 UB)
- Adds `assert(min_r <= max_r)` to catch inverted-range UB when both
bounds exceed the type's representable range
- Provides default member values (0.125f, 2.0f) and `std::optional` seed
consistent with sibling fillers
- `/** */` Doxygen style with `@note` on snapping asymmetry

**`test/ck_tile/gemm_mx/test_mx_gemm_pipeline_util.hpp`** — fix scale
initialization
- Replace manual byte-range distribution with
`FillUniformScaleDistribution<>{0.125f, 2.0f}`
- Use distinct seeds for scale_a (11941) and scale_b (11943) to avoid
correlated scale tensors that were causing 60 test failures for
fp4+e5m3/e4m3 combinations

**`test/ck_tile/utility/test_fill.cpp`** — new unit tests for
`FillUniformScaleDistribution`
- 16 typed tests across e8m0, e4m3, e5m3: validity, range,
reproducibility, coverage, snapping, stress, nullopt seed, and range
overload
- Test helper `expected_raw_range` mirrors implementation clamping
exactly
2026-05-29 18:45:13 +00:00
Yaswanth Raparti
fe085f8a69 [rocm-libraries] ROCm/rocm-libraries#7761 (commit 237b766)
[CK][CK TILE] Clean up tile_engine grouped_conv harness
 (#7761)

## Motivation
Tile_engine grouped_conv contains ML heuristic validation scripts which
cause confusion to new developers. So, this PR is intended to relocate
the scripts into dispatcher/heuristic directory to maintain separation
of concern.

## Technical Details
The grouped_conv tile_engine directory is a benchmarking harness for
grouped convolution kernels; ML-heuristic content does not belong there.

- Move compare_ml_vs_oracle.py and validate_ml_vs_oracle.py from
tile_engine/ops/grouped_conv/ to
dispatcher/heuristics/validation/grouped_conv/, and rebase their
sys.path / oracle CSV / model dir lookups for the new location (CSV path
is now an --oracle-csv flag instead of a hard-coded sibling).
- Move GROUPED_CONV_HEURISTIC_REPORT.md (system-level ML report) into
dispatcher/heuristics/ where the rest of the heuristic docs live.
- Rewrite tile_engine/ops/grouped_conv/README.md as a pure benchmarking
/ dispatcher-sweep doc (kernel enumeration, JIT pipeline, CSV schema,
problem registry), in the style of tile_engine/ops/fmha/README.md. All
ML training / model-efficiency content is removed and replaced with a
pointer to dispatcher/heuristics/.

## Test Plan

Validation scripts are re-wired and tested locally

## Test Result

Tests passed on local machine.

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 17:09:29 +00:00
Andriy Roshchenko
d5c9215064 [rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC

## Motivation

Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async
load instruction for tensor A, which complements the existing MX GEMM
FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines
are enabled on gfx1250.

## Technical Details

The existing gfx950 implementation was extended to support gfx1250
architecture. All three MX FP data types are supported across the two
ASICs.
It should be noted that while the TDM pipeline uses an emulated
32x32x128 warp-tile instruction, the present submission relies on the
built-in 16x16x128 instruction, called 4 times per warp.

## Test Plan

Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250
functionality.

To help facilitate the testing in development,
`example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was
introduced to verify various combinations of supported data types and
pipeline versions.

## Test Result

The present submission is expected to work on both gfx950 and gfx1250
hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
- [x] Relies on #6978 and should only be merged after the changes are
merged to the `develop`.
2026-05-29 17:02:45 +00:00
Sami Remes
b619c374eb [rocm-libraries] ROCm/rocm-libraries#5438 (commit 7000562)
[CK_TILE] Normalize gpu_target before LDS_SIZE_MAP lookup
 (#5438)

GPU targets passed with feature suffixes (e.g. `gfx950:xnack+`) were
falling through to `DEFAULT_LDS_SIZE` instead of matching their entry in
`LDS_SIZE_MAP`, potentially causing incorrect tile acceptance/rejection.

## Changes

- **`gemm_validation_utils.py`**: Strip everything after `:` from
`gpu_target` before the `LDS_SIZE_MAP` lookup; use the normalized base
arch name in the error message as well.

```python
# Before
hw_lds_size = LDS_SIZE_MAP.get(gpu_target, DEFAULT_LDS_SIZE)

# After
base_gpu_target = gpu_target.split(":")[0] if gpu_target else gpu_target
hw_lds_size = LDS_SIZE_MAP.get(base_gpu_target, DEFAULT_LDS_SIZE)
```
2026-05-29 16:33:15 +00:00
Illia Silin
8bd8094012 [rocm-libraries] ROCm/rocm-libraries#7833 (commit 8a444cd)
[CK] Replace deprecated load_module function in python
 (#7833)

## Motivation

Recent pytorch builds with python 3.15 failed in CK due to deprecation
of load_module function. This should fix the issue.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 15:29:58 +00:00
Bartłomiej Kocot
5d912538d3 [rocm-libraries] ROCm/rocm-libraries#7847 (commit b995ef2)
[CK] Remove IsPackedTensor function

## Motivation

Fix codegen hipRTC

## Technical Details

Remove not needed function. Since MakeArgument supports long_index_t
strides.

## Test Plan

Codegen tests.

## Test Result

Passed.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 14:00:06 +00:00
Ville Pietilä
78d657c4f7 [rocm-libraries] ROCm/rocm-libraries#7284 (commit e7d25b2)
[CK_TILE] Integrate CK Tile Dispatcher code generation into
 CK Tile Profiler (#7284)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

CK Tile is going to be delivered to hipDNN via CK Dispatcher. Currently
the CK Tile Profiler using CK Builder for generating the profiled
instances from the configuration files that identify the instances that
old CK exposes. We need to replace this instance generation with the CK
Tile Dispatcher codegen.

## Technical Details
The old CK Profiler config files are converted to JSON files that the CK
Tile Dispatcher can digest. The conversion script for configurations is
stored to source control in case we need to update the JSON
configurations later. The dispatcher generates instance libraries per
conv direction (fwd, bwd data, and bwd weight) that are linked to the CK
Profiler executable. I also implemented codegne for the stream-K and
depthwise conv instances. The proposed solution replaces the CK Builder
codegen with the CK Tile Dispatcher codegen.

There are two new methods that are exposed via the dispatcher backend

- `is_supported` - required to enabled the profiler workflow where we
check the applicability of the kernel instance before running it.
- `get_instance_string` - this mainly for verification. This provide the
CK Builder instance string for verifying that the old CK Builder based
profiler and the new CK Tile Dispatcher based profiler have the same
instances.

The rules that limit the generated instances are now collected to a
single location under the dispacther. The CK Builder codegen uses these,
which ensures that the two codegen pipelines are in sync. The next step
(different PR) is to remove the CK Builder codegen pipeline altogether.

## Test Plan

Verified that the old CK Builder based profiler and the new CK Tile
Dispatcher based profiler have the same instances, that is, the
Dispatcher based codgen can generate the same instances as the old CK
Builder.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-28 21:03:37 +00:00
ltqin
bf07a0150e [rocm-libraries] ROCm/rocm-libraries#7723 (commit 4ed6c51)
[CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels
 (#7723)

###  Motivation
The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not
support LSE (Log-Sum-Exp) output. This PR enables LSE output support for
fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics
alongside attention outputs.
### Technical Details
    - StandardAttention: lse = softmax_scale * m + log(l)
- LogitsSoftCap: lse = (m / log2(e)) + log(l)

### Test Plan
Run FMHA forward example with fp8bf16 precision and LSE output enabled:
- Test 1: Basic LSE functionality
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1
- Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter)
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0
2026-05-28 15:58:54 +00:00
dependabot[bot]
c1aee52d3d [rocm-libraries] ROCm/rocm-libraries#7303 (commit 27b6b8c)
Bump urllib3 from 2.6.3 to 2.7.0 in
 /projects/composablekernel/docs/sphinx (#7303)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.6.3 to 2.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/urllib3/urllib3/releases">urllib3's
releases</a>.</em></p>
<blockquote>
<h2>2.7.0</h2>
<h2>🚀 urllib3 is fundraising for HTTP/2 support</h2>
<p><a
href="https://sethmlarson.dev/urllib3-is-fundraising-for-http2-support">urllib3
is raising ~$40,000 USD</a> to release HTTP/2 support and ensure
long-term sustainable maintenance of the project after a sharp decline
in financial support. If your company or organization uses Python and
would benefit from HTTP/2 support in Requests, pip, cloud SDKs, and
thousands of other projects <a
href="https://opencollective.com/urllib3">please consider contributing
financially</a> to ensure HTTP/2 support is developed sustainably and
maintained for the long-haul.</p>
<p>Thank you for your support.</p>
<h2>Security</h2>
<p>Addressed high-severity security issues. Impact was limited to
specific use cases detailed in the accompanying advisories; overall user
exposure was estimated to be marginal.</p>
<ul>
<li>
<p>Decompression-bomb safeguards of the streaming API were bypassed:</p>
<ol>
<li>When <code>HTTPResponse.drain_conn()</code> was called after the
response had been read and decompressed partially. (Reported by <a
href="https://github.com/Cycloctane"><code>@​Cycloctane</code></a>)</li>
<li>During the second <code>HTTPResponse.read(amt=N)</code> or
<code>HTTPResponse.stream(amt=N)</code> call when the response was
decompressed using the official <a
href="https://pypi.org/project/brotli/">Brotli</a> library. (Reported by
<a
href="https://github.com/kimkou2024"><code>@​kimkou2024</code></a>)</li>
</ol>
<p>See GHSA-mf9v-mfxr-j63j for details.</p>
</li>
<li>
<p>HTTP pools created using
<code>ProxyManager.connection_from_url</code> did not strip sensitive
headers specified in <code>Retry.remove_headers_on_redirect</code> when
redirecting to a different host. (GHSA-qccp-gfcp-xxvc reported by <a
href="https://github.com/christos-spearbit"><code>@​christos-spearbit</code></a>)</p>
</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>Used <code>FutureWarning</code> instead of
<code>DeprecationWarning</code> for better visibility of existing
deprecation notices. Rescheduled the removal of deprecated features to
version 3.0. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3763">urllib3/urllib3#3763</a>)</li>
<li>Removed support for end-of-life Python 3.9. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3720">urllib3/urllib3#3720</a>)</li>
<li>Removed support for end-of-life PyPy3.10. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4979">urllib3/urllib3#4979</a>)</li>
<li>Bumped the minimum supported pyOpenSSL version to 19.0.0. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3777">urllib3/urllib3#3777</a>)</li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>Fixed a bug where <code>HTTPResponse.read(amt=None)</code> was
ignoring decompressed data buffered from previous partial reads. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3636">urllib3/urllib3#3636</a>)</li>
<li>Fixed a bug where <code>HTTPResponse.read()</code> could cache only
part of the response after a partial read when
<code>cache_content=True</code>. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4967">urllib3/urllib3#4967</a>)</li>
<li>Fixed <code>HTTPResponse.stream()</code> and
<code>HTTPResponse.read_chunked()</code> to handle <code>amt=0</code>.
(<a
href="https://redirect.github.com/urllib3/urllib3/issues/3793">urllib3/urllib3#3793</a>)</li>
<li>Updated <code>_TYPE_BODY</code> type alias to include missing
<code>Iterable[str]</code>, matching the documented and runtime behavior
of chunked request bodies. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3798">urllib3/urllib3#3798</a>)</li>
<li>Fixed <code>LocationParseError</code> when paths resembling
schemeless URIs were passed to
<code>HTTPConnectionPool.urlopen()</code>. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3352">urllib3/urllib3#3352</a>)</li>
<li>Fixed <code>BaseHTTPResponse.readinto()</code> type annotation to
accept <code>memoryview</code> in addition to <code>bytearray</code>,
matching the <code>io.RawIOBase.readinto</code> contract and enabling
use with <code>io.BufferedReader</code> without type errors. (<a
href="https://redirect.github.com/urllib3/urllib3/issues/3764">urllib3/urllib3#3764</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/urllib3/urllib3/blob/main/CHANGES.rst">urllib3's
changelog</a>.</em></p>
<blockquote>
<h1>2.7.0 (2026-05-07)</h1>
<h2>Security</h2>
<p>Addressed high-severity security issues.
Impact was limited to specific use cases detailed in the accompanying
advisories; overall user exposure was estimated to be marginal.</p>
<ul>
<li>
<p>Decompression-bomb safeguards of the streaming API were bypassed:</p>
<ol>
<li>When <code>HTTPResponse.drain_conn()</code> was called after the
response had been
read and decompressed partially.</li>
<li>During the second <code>HTTPResponse.read(amt=N)</code> or
<code>HTTPResponse.stream(amt=N)</code> call when the response was
decompressed
using the official <code>Brotli
&lt;https://pypi.org/project/brotli/&gt;</code>__ library.</li>
</ol>
<p>See <code>GHSA-mf9v-mfxr-j63j
&lt;https://github.com/urllib3/urllib3/security/advisories/GHSA-mf9v-mfxr-j63j&gt;</code>__
for details.</p>
</li>
<li>
<p>HTTP pools created using
<code>ProxyManager.connection_from_url</code> did not strip
sensitive headers specified in
<code>Retry.remove_headers_on_redirect</code> when
redirecting to a different host.
(<code>GHSA-qccp-gfcp-xxvc
&lt;https://github.com/urllib3/urllib3/security/advisories/GHSA-qccp-gfcp-xxvc&gt;</code>__)</p>
</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>Used <code>FutureWarning</code> instead of
<code>DeprecationWarning</code> for better
visibility of existing deprecation notices. Rescheduled the removal of
deprecated features to version 3.0.
(<code>[#3763](https://github.com/urllib3/urllib3/issues/3763)
&lt;https://github.com/urllib3/urllib3/issues/3763&gt;</code>__)</li>
<li>Removed support for end-of-life Python 3.9.
(<code>[#3720](https://github.com/urllib3/urllib3/issues/3720)
&lt;https://github.com/urllib3/urllib3/issues/3720&gt;</code>__)</li>
<li>Removed support for end-of-life PyPy3.10.
(<code>[#4979](https://github.com/urllib3/urllib3/issues/4979)
&lt;https://github.com/urllib3/urllib3/issues/4979&gt;</code>__)</li>
<li>Bumped the minimum supported pyOpenSSL version to 19.0.0.
(<code>[#3777](https://github.com/urllib3/urllib3/issues/3777)
&lt;https://github.com/urllib3/urllib3/issues/3777&gt;</code>__)</li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>Fixed a bug where <code>HTTPResponse.read(amt=None)</code> was
ignoring decompressed
data buffered from previous partial reads.
(<code>[#3636](https://github.com/urllib3/urllib3/issues/3636)
&lt;https://github.com/urllib3/urllib3/issues/3636&gt;</code>__)</li>
<li>Fixed a bug where <code>HTTPResponse.read()</code> could cache only
part of the
response after a partial read when <code>cache_content=True</code>.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="9a950b92d9"><code>9a950b9</code></a>
Release 2.7.0</li>
<li><a
href="5ec0de499b"><code>5ec0de4</code></a>
Merge commit from fork</li>
<li><a
href="2bdcc44d1e"><code>2bdcc44</code></a>
Merge commit from fork</li>
<li><a
href="f45b0df09d"><code>f45b0df</code></a>
Fix a misleading example for <code>ProxyManager</code> (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4970">#4970</a>)</li>
<li><a
href="577193ca02"><code>577193c</code></a>
Switch to nightly PyPy3.11 in CI for now (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4984">#4984</a>)</li>
<li><a
href="e90af45bb0"><code>e90af45</code></a>
Avoid infinite loop in <code>HTTPResponse.read_chunked</code> when
<code>amt=0</code> (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4974">#4974</a>)</li>
<li><a
href="67ed74fdae"><code>67ed74f</code></a>
Bump dev dependencies (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4972">#4972</a>)</li>
<li><a
href="3abd481097"><code>3abd481</code></a>
Upgrade mypy to version 1.20.2 (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4978">#4978</a>)</li>
<li><a
href="2b8725dfca"><code>2b8725d</code></a>
Drop support for EOL PyPy3.10 (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4979">#4979</a>)</li>
<li><a
href="2944b2a0a6"><code>2944b2a</code></a>
Upgrade <code>setup-chrome</code> and <code>setup-firefox</code> to fix
warnings (<a
href="https://redirect.github.com/urllib3/urllib3/issues/4973">#4973</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/urllib3/urllib3/compare/2.6.3...2.7.0">compare
view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=urllib3&package-manager=pip&previous-version=2.6.3&new-version=2.7.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
2026-05-28 15:40:48 +00:00
Illia Silin
016f8891de [rocm-libraries] ROCm/rocm-libraries#7815 (commit e34ac06)
[CK] fix daily build of CK for all supported targets.

## Motivation

Fixing the daily build of CK packages for all supported targets. In the
past, if no GPU_TARGETS was specified, we would by default build CK for
all supported targets, But recently, the MIOpen team requested to change
the default behavior to not build at all if no target is specified (for
the purposes of filtering out unsupported targets in TheRock). So just
adding the explicit list of targets to our daily builds now.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-28 14:52:19 +00:00
Zoltán Lakatos
58e2ab1fc7 [rocm-libraries] ROCm/rocm-libraries#6761 (commit d19f6f1)
[CK] Large tensor gemm workaround (#6761)

## Motivation

Customer qeruested large tensor gemm support for 8bit and 4bit data
types. Currently CK triggers “This GEMM not supported” error. The root
cause appears to be the 2 GB limit on the input/output matrix, triggered
by buffer offset constraints when testing a larger shape such as M =
699,904 (which is an exact multiple of MPerBlock = 256).

## Technical Details

Quick workaround to have support ASAP. Split the tensors into inputs /
outputs smaller than 2GB limit. Iterate on host and call all subproblems
without device code change.
Support is restricted to rowise layout in A, Ds and E

All changes were implemented in DeviceGemm structures to avoid secondory
affect on grouped convolutions.

Got lots of AI generated comments. Addressed the ones that seemed
relevant on the functionality.

## Test Plan

Within CK the following examples can be used with modified input sizes:
example_gemm_multiply_multiply_xdl_fp8
example_gemm_mx_fp4
Tested with Aiter tuning on provided shapes.

## Test Result

All gemms run and provide correct results.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-05-27 18:55:15 +00:00
Aviral Goel
6a17f951ea [rocm-libraries] ROCm/rocm-libraries#7714 (commit 13ae6d6)
[CK_TILE] Restructure naive GEMM tutorial and add tile distribution tutorials (#7714)

## Summary
- Flatten naive GEMM tutorial directory structure (remove
`block_level/`, `host_level/`, `warp_level/` subdirs) to match the
composable_kernel repo layout
- Add `CK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION` macro switch to toggle
between standard and transposed WarpGemm variants
- Consolidate 6 verbose markdown files (~2600 lines) into one concise
README (~120 lines)
- Add 3 tile distribution encoding tutorials with step-by-step "How to
read Ps/Ys" annotations:
- Tutorial 1: A-matrix DRAM load (256×32) — NDimP=2, coalesced
K-splitting
- Tutorial 2: B-matrix DRAM load (128×32) — same pattern, fewer
iterations
- Tutorial 3: C-matrix register layout (32×32) — MFMA m32n32k8 hardware
output mapping, standard vs transposed
- Tile distribution tutorials guarded to build only for gfx942 and
gfx950
2026-05-27 14:10:29 -04:00