Commit Graph

3197 Commits

Author SHA1 Message Date
Jobbins
9426f49b52 [rocm-libraries] ROCm/rocm-libraries#6064 (commit cce30ab)
[CK] poll develop every 15 minutes for changes
2026-04-01 14:35:42 +00:00
Fu-Cheng Tsai
a502e5a00b [rocm-libraries] ROCm/rocm-libraries#5798 (commit 7acd4e7)
[CK_TILE] Update gfx12 FMHA forward kernel configs
2026-04-01 14:23:38 +00:00
aledudek
119712bd90 [rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0)
[CK_TILE] Add pooling in tile_engine

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Add pooling in ck tile engine

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-01 07:32:36 +00:00
Yi DING
791afc6465 [rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e)
[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR
 misallocation (#5991)

## Motivation

After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard
around the
`num_total_loop <= 0` early-exit check, the IGLP pipeline
(`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV
gradients for
non-masking kernels (even with fix in #5915). Assembly inspection
confirms that the CFG change causes the LLVM
register allocator to reuse AGPR accumulators as scratch destinations in
the dK/dV
reduction loop, breaking the loop-carried accumulation across Q-tile
iterations.

## Technical Details

- Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in
`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing:
it
restores the CFG shape that the register allocator needs to correctly
assign
  dedicated AGPRs to each column of the dK/dV accumulator.
- Only the IGLP pipeline is affected; the other two BWD pipelines do not
exhibit
  this issue.

## Test Plan

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-01 05:45:19 +00:00
Estevan Vedovelli
a33b5be1b9 [rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a)
[CK] contraction: extend GetTypeString() to include
 layout-differentiating params (#6022)

## Motivation

Consumers that identify kernels by their `GetTypeString()` (such as
hipTensor's actor-critic kernel selection, which hashes the string into
a
stable cross-platform UID) were silently dropping one of two colliding
variants during registry insertion.

`GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle`
previously
printed 13 template parameters, omitting
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.

These four parameters determine the block-transfer access width and LDS
padding strategy, and are precisely what differentiates the `kk`, `kn`,
`mk`, and `mn` layout variants from one another when all other geometry
parameters are equal. Two instantiations with identical 13-parameter
strings
are distinct C++ types that accept different stride layouts and reject
each
other's arguments via `IsSupportedArgument`.

This patch extends the output to 17 parameters so that every distinct
template instantiation of this class produces a unique
`GetTypeString()`.

## Technical Details

`include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`:
- extend `GetTypeString()` from 13 to 17 parameters including
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.

## Test Plan

Build CK and hipTensor with these changes, and verify hipTensor can
differentiate and select the
correct kernels with layout variations.

## Test Result

CK is building correctly and hipTensor is selecting the kernels
correctly.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-31 15:19:43 +00:00
Bartłomiej Kocot
ef4ff4667d [rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690)
[CK][CK Tile] Force padding for atomic_add bf16 C tensor
 (#5842)

## Motivation

Force padding for atomic_add bf16 C tensor to avoid memfaults.

## Technical Details

- add global atomic add for bf16 and enable them
- add padding for atomic add bf16 due to the lack of oob
- remove padding for not continous dims in conv for other cases
- minor bwd data conv fixes

## Test Plan

test_grouped_conv_*_tile

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-31 08:03:41 +00:00
jakpiase
66dc81d530 [rocm-libraries] ROCm/rocm-libraries#5729 (commit 516c974)
[CK_TILE] Changed cshuffle LDS descriptor to naive layout
 (#5729)

## Motivation
This PR changes gemm/convolution cshuffle layout into plain one. to
improve cshuffle operation performance.

## Technical Details
The purpose is that before this change the cshuffle layout was having
some descriptor transformations that were probably aimed at reducing LDS
bank conflicts, but the transformations itself were terribly slow, which
negatively impacted the performance.

## Test Plan
There is no need for additional tests, since current tests cover this
functionality.
2026-03-31 03:40:25 +00:00
Illia Silin
e6b8094f94 [rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b)
[CK] fix clang lifetimebound errors with staging compiler
 (#5921)

## Motivation

The ROCm staging compiler (newer Clang) enforces
`[[clang::lifetimebound]]` annotations on methods that return references
or pointers to internal object data. Without these annotations, the
staging compiler emits compilation errors for container accessor methods
across the CK and CK Tile namespaces.

  ## Technical Details

Adds `[[clang::lifetimebound]]` to all reference/pointer-returning
accessors in core container types:

  **`ck::` namespace:**
  - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()`
  - `index_array` -- `operator[]`
  - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()`
  - `IndexLookupTable` -- `operator[]`

  **`ck_tile::` namespace:**
  - `array` -- `get(i)`, `at()`, `operator[]`, `operator()`
  - `static_array` -- `operator[]`
  - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()`
  - `make_kernel()` -- parameter pack

Also removes the unused `instance_index` variable from
`batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing
  accordingly.

  ## Test Plan

- Compile with the staging compiler to verify all lifetimebound errors
are resolved
- Existing tests pass unchanged -- the attribute is a compile-time
annotation with no runtime effect

 ## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 14:20:20 +00:00
Hosang Yoon
2dcae9d173 [rocm-libraries] ROCm/rocm-libraries#5977 (commit 794bea7)
[CK_TILE] Fix Windows build in FMHA head grouping

## Motivation

This is a follow-up fix for [PR
#5018](https://github.com/ROCm/rocm-libraries/pull/5018).

[PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added
LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it
also introduced Linux-only code paths, including `<dirent.h>`, which
break Windows builds. This change fixes that by guarding the
Linux-specific LLC probing logic so non-Linux platforms can still build
correctly.

## Technical Details

- Guard `<dirent.h>` with `#ifdef __linux__`
- Guard KFD sysfs traversal logic with `#if defined(__linux__)`
- On non-Linux platforms, return `0` from
`get_kfd_sysfs_llc_cache_bytes()`
- Preserve existing fallback behavior through:
  - `CK_TILE_FMHA_LLC_CACHE_MB`
  - arch-based default LLC sizes
  - no head grouping when no LLC size can be resolved

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 14:19:19 +00:00
Jeff Huang
7968368d92 [rocm-libraries] ROCm/rocm-libraries#5918 (commit a7e2c67)
[CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill
 (#5918)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
FP8 batch prefill kernels currently only support head_dim=128. Models
with head_dim=256 hit the "invalid argument for batch_prefill" error
because no matching kernel variant exists in the codegen dispatch.

## Technical Details
Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen
recipe (`fmha_batch_prefill.py`).

Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1,
4,1,1, 32,32,32, 32,32,32, -1)`
- bm0=128, bn0=128 (Q/K tile sizes)
- bn1=256, bk0max=256 (V head_dim=256)
- Warp MFMA 32x32x32 (fp8 MFMA instructions)

This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp
sizes.

## Test Plan
Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch
prefill test with the following matrix:
- page_size: {1, 16, 1024}
- kv_layout: {linear, vectorized}
- lookup_table: {sglang, vllm}
- causal: {true, false}
- logits_soft_cap: {0.0, 30.0}
- contiguous_kv: {true, false}

## Test Result

**MI308X (gfx942):** 160 passed, 32 skipped (page_size=1 + vectorized
not applicable)
**MI355X (gfx950):** 120 passed, 72 skipped (pre-existing ROCm 7.2
compiler issue with causal + no softcap)

No register spills on either platform.

### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8
heads

| page_sz | kv_layout | table | causal | soft_cap | time_us | TFLOPS |
|---------|-----------|-------|--------|----------|---------|--------|
| 1 | linear | sglang | False | 0.00 | 55.01 | 156.16 |
| 1 | linear | vllm | False | 0.00 | 55.12 | 155.84 |
| 1 | linear | sglang | False | 30.00 | 62.63 | 137.16 |
| 1 | linear | vllm | False | 30.00 | 62.16 | 138.20 |
| 1 | linear | sglang | True | 30.00 | 64.09 | 67.01 |
| 1 | linear | vllm | True | 30.00 | 63.85 | 67.27 |
| 16 | linear | sglang | False | 0.00 | 57.00 | 150.69 |
| 16 | vectorized | sglang | False | 0.00 | 57.55 | 149.25 |
| 16 | linear | vllm | False | 0.00 | 56.80 | 151.23 |
| 16 | vectorized | vllm | False | 0.00 | 57.32 | 149.87 |
| 16 | linear | sglang | False | 30.00 | 64.77 | 132.62 |
| 16 | vectorized | vllm | False | 30.00 | 63.54 | 135.18 |
| 16 | linear | sglang | True | 30.00 | 66.84 | 64.26 |
| 16 | vectorized | vllm | True | 30.00 | 66.12 | 64.96 |
| 1024 | linear | sglang | False | 0.00 | 58.25 | 147.46 |
| 1024 | vectorized | sglang | False | 0.00 | 57.53 | 149.31 |
| 1024 | linear | vllm | False | 0.00 | 58.06 | 147.94 |
| 1024 | vectorized | vllm | False | 0.00 | 57.55 | 149.27 |
| 1024 | linear | sglang | False | 30.00 | 65.38 | 131.38 |
| 1024 | vectorized | vllm | False | 30.00 | 63.64 | 134.98 |
| 1024 | linear | sglang | True | 30.00 | 66.85 | 64.25 |
| 1024 | vectorized | vllm | True | 30.00 | 65.26 | 65.81 |

### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8
heads

| page_sz | kv_layout | table | causal | soft_cap | time_us | TFLOPS |
|---------|-----------|-------|--------|----------|---------|--------|
| 1 | linear | sglang | False | 0.00 | 110.18 | 77.96 |
| 1 | linear | vllm | True | 30.00 | 134.33 | 31.97 |
| 1 | linear | sglang | True | 30.00 | 134.59 | 31.91 |
| 16 | linear | sglang | False | 0.00 | 115.43 | 74.42 |
| 16 | vectorized | sglang | False | 0.00 | 106.11 | 80.95 |
| 16 | linear | vllm | False | 0.00 | 116.34 | 73.83 |
| 16 | vectorized | vllm | False | 0.00 | 106.17 | 80.91 |
| 16 | linear | sglang | False | 30.00 | 135.61 | 63.34 |
| 16 | vectorized | vllm | False | 30.00 | 122.37 | 70.20 |
| 16 | linear | sglang | True | 0.00 | 117.44 | 36.57 |
| 16 | vectorized | vllm | True | 0.00 | 108.81 | 39.47 |
| 16 | linear | sglang | True | 30.00 | 139.43 | 30.80 |
| 16 | vectorized | vllm | True | 30.00 | 125.87 | 34.12 |
| 1024 | linear | sglang | False | 0.00 | 110.65 | 77.63 |
| 1024 | vectorized | sglang | False | 0.00 | 101.70 | 84.46 |
| 1024 | linear | vllm | False | 0.00 | 111.71 | 76.89 |
| 1024 | vectorized | vllm | False | 0.00 | 101.55 | 84.59 |
| 1024 | linear | sglang | False | 30.00 | 129.33 | 66.42 |
| 1024 | vectorized | vllm | False | 30.00 | 120.95 | 71.02 |
| 1024 | linear | sglang | True | 0.00 | 112.26 | 38.26 |
| 1024 | vectorized | vllm | True | 0.00 | 103.02 | 41.69 |
| 1024 | linear | sglang | True | 30.00 | 133.73 | 32.12 |
| 1024 | vectorized | vllm | True | 30.00 | 124.75 | 34.43 |

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 10:21:33 +00:00
Yi DING
fb64a4453c [rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d)
[CK_TILE] Fix FMHA BWD register pressure by wrapping
 num_total_loop with amd_wave_read_first_lane (#5915)

## Motivation

In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.

## Technical Details

Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`

This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 01:45:16 +00:00
Jan Patrick Lehr
b6bbada9f1 [rocm-libraries] ROCm/rocm-libraries#5639 (commit a65e645)
[CK] More lifetime-warning suppression

## Motivation

The staging compiler picked up another change from upstream that leads
to more lifetime-analysis warnings. This breaks the build, given CK is
built with -Werror. As a result, compiler promotion is blocked.

## Technical Details
This patch adds the pragma push diagnostics to ignore the
lifetime-warnings in the modified files to unblock compiler promotion.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-28 11:20:51 +00:00
Linjun-AMD
3b55a05e71 [rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2)
[CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op
 (#5095)" (#5849)

This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 20:37:23 +00:00
Bartłomiej Kocot
c28d0033d7 [rocm-libraries] ROCm/rocm-libraries#5785 (commit d8ecfc1)
[CK] Fix min k_batch calculation in conv kernels

## Motivation

Avoid division by 0 and remove not needed "-1".

## Technical Details

Our div up implementation return lower value if input is divisible.
There is no need to subtract 1.

## Test Plan

test_grouped_conv_bwd_weight

## Test Result

Passed locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-1019
2026-03-27 15:38:21 +00:00
Illia Silin
4c926497ad [rocm-libraries] ROCm/rocm-libraries#5829 (commit 19b2813)
[CK] Fix error in dockerfile when building staging compiler.
 (#5829)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 15:37:21 +00:00
Johannes Graner
58475d3f45 [rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649)
[CK Tile] StreamK support for Bwd Weight grouped convolutions
 (#5393)

## Motivation

Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.

## Technical Details

StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)

Both persistent and non-persistent data-parallel (DP) sections are
supported.

Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression

### Performance (MI355X, gfx950)

Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):

| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |

StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.

## Test Plan

```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk

# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```

30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression

## Test Result

All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 09:18:14 +00:00
arai713
36f2ec23f5 [rocm-libraries] ROCm/rocm-libraries#5445 (commit 2cdbf8b)
[CK_TILE] Support for CompV4 pipeline in Stream-K GEMM
 (#5445)

## Motivation
This PR is extending the pipeline support for Stream-K GEMM by adding
the CompV4 pipeline. Additional pipelines will be added in subsequent
PRs.

## Technical Details

- Enable the CompV4 pipeline by adding an option to set DoubleSMemBuffer
to true if the CompV4 pipeline has been selected as it requires double
buffered shared memory
- Addition of CompV4 pipeline into the extended tests: kernel instances
mirror the existing CompV3/Mem configurations (same layout permutations,
data types, and tile sizes) with the pipeline type set to CompV4.
- Addition of CompV4 pipeline into smoke tests (generated using Tile
Engine)

## Test Plan
These were tested using the existing smoke and extended tests.

## Test Result
All tests passed
## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 08:13:27 +00:00
Yi DING
47a04fda08 [rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a)
[CK_TILE] Fix NaN for FMHA BWD When seq_q=0

## Motivation
This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the
effective query sequence length for a tile is zero, by ensuring the
per-tile pipelines exit early with zeroed accumulators and by avoiding
an early kernel return that prevented writing out cleared gradients.

## Technical Details
- Add unconditional early-exit in the dK/dV pipelines when
`num_total_loop <= 0` (no work), returning zeroed accumulators.
- Adjust group-mode kernel early-return logic to only return when
**both** `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and
store cleared dK/dV when `seqlen_q == 0`.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 07:54:53 +00:00
Yaswanth Raparti
e2470e837a [rocm-libraries] ROCm/rocm-libraries#5880 (commit a6b6c05)
[CK][CK_TILE] Fix CTest parsing to handle all test number
 formats (#5880)

## Motivation
Fix a bug in the smart-build --ctest-only filter that was incorrectly
excluding tests with numbers less than 100.

## Technical Details
The issue was caused by CTest formatting test numbers with variable
spacing based on the number of digits:
  - "Test   `#1`: name (3 spaces for tests 1-9)"
  - "Test  `#79`: name (2 spaces for tests 10-99)"
  - "Test `#100`: name (1 space for tests 100+)"

The previous code used `line.strip().startswith("Test #")` which only
matched tests with a single space (i.e., test numbers >= 100).

This caused tests like ck_tile_unit_sequence (Test #79) to be excluded
from smart-build test selection, resulting in CTest failures when the
binary wasn't built.

Solution: Replace string matching with a regex pattern that handles
all spacing variations: r'^\s*Test\s+#\d+:\s*(.+)$'

## Test Plan
Tested with test numbers from 1 to 12345.

## Test Result
  - Before: 48 tests selected (only tests #100+)
  - After: 146 tests selected (all CTest-registered tests)

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-27 06:34:12 +00:00
Illia Silin
2f98c7bbef [rocm-libraries] ROCm/rocm-libraries#5891 (commit 82563ff)
fix AITER docker setup

## Motivation

Add a new python package required to build AITER.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 04:36:16 +00:00
Bartłomiej Kocot
1c95ce0668 [rocm-libraries] ROCm/rocm-libraries#5856 (commit 2d9a0a1)
[CK] Fix unused param mask

## Motivation

Compiler error caused by unused param mask.

## Technical Details

Skip tests using param mask in test loop.

## Test Plan

Current test improvements.

## Test Result

Passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 03:58:37 +00:00
dependabot[bot]
6215bb8dbc [rocm-libraries] ROCm/rocm-libraries#5896 (commit b7436b5)
Bump requests from 2.32.5 to 2.33.0 in
 /projects/composablekernel/docs/sphinx (#5896)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/releases">requests's
releases</a>.</em></p>
<blockquote>
<h2>v2.33.0</h2>
<h2>2.33.0 (2026-03-25)</h2>
<p><strong>Announcements</strong></p>
<ul>
<li>📣 Requests is adding inline types. If you have a typed code base
that uses Requests, please take a look at <a
href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>.
Give it a try, and report any gaps or feedback you may have in the
issue. 📣</li>
</ul>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now
extracts contents to a non-deterministic location to prevent malicious
file replacement. This does not affect default usage of Requests, only
applications calling the utility function directly.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Migrated to a PEP 517 build system using setuptools. (<a
href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li>
</ul>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed an issue where an empty netrc entry could cause malformed
authentication to be applied to Requests on Python 3.11+. (<a
href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Dropped support for Python 3.9 following its end of support. (<a
href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Various typo fixes and doc improvements.</li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/M0d3v1"><code>@​M0d3v1</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6865">psf/requests#6865</a></li>
<li><a href="https://github.com/aminvakil"><code>@​aminvakil</code></a>
made their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7220">psf/requests#7220</a></li>
<li><a href="https://github.com/E8Price"><code>@​E8Price</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6960">psf/requests#6960</a></li>
<li><a href="https://github.com/mitre88"><code>@​mitre88</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7244">psf/requests#7244</a></li>
<li><a href="https://github.com/magsen"><code>@​magsen</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6553">psf/requests#6553</a></li>
<li><a
href="https://github.com/Rohan5commit"><code>@​Rohan5commit</code></a>
made their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7227">psf/requests#7227</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25">https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's
changelog</a>.</em></p>
<blockquote>
<h2>2.33.0 (2026-03-25)</h2>
<p><strong>Announcements</strong></p>
<ul>
<li>📣 Requests is adding inline types. If you have a typed code base
that
uses Requests, please take a look at <a
href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>.
Give it a try, and report
any gaps or feedback you may have in the issue. 📣</li>
</ul>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now
extracts
contents to a non-deterministic location to prevent malicious file
replacement. This does not affect default usage of Requests, only
applications calling the utility function directly.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Migrated to a PEP 517 build system using setuptools. (<a
href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li>
</ul>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed an issue where an empty netrc entry could cause
malformed authentication to be applied to Requests on
Python 3.11+. (<a
href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Dropped support for Python 3.9 following its end of support. (<a
href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Various typo fixes and doc improvements.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="bc04dfd6da"><code>bc04dfd</code></a>
v2.33.0</li>
<li><a
href="66d21cb07b"><code>66d21cb</code></a>
Merge commit from fork</li>
<li><a
href="8b9bc8fc0f"><code>8b9bc8f</code></a>
Move badges to top of README (<a
href="https://redirect.github.com/psf/requests/issues/7293">#7293</a>)</li>
<li><a
href="e331a288f3"><code>e331a28</code></a>
Remove unused extraction call (<a
href="https://redirect.github.com/psf/requests/issues/7292">#7292</a>)</li>
<li><a
href="753fd08c5e"><code>753fd08</code></a>
docs: fix FAQ grammar in httplib2 example</li>
<li><a
href="774a0b837a"><code>774a0b8</code></a>
docs(socks): same block as other sections</li>
<li><a
href="9c72a41bec"><code>9c72a41</code></a>
Bump github/codeql-action from 4.33.0 to 4.34.1</li>
<li><a
href="ebf7190679"><code>ebf7190</code></a>
Bump github/codeql-action from 4.32.0 to 4.33.0</li>
<li><a
href="0e4ae38f0c"><code>0e4ae38</code></a>
docs: exclude Response.is_permanent_redirect from API docs (<a
href="https://redirect.github.com/psf/requests/issues/7244">#7244</a>)</li>
<li><a
href="d568f47278"><code>d568f47</code></a>
docs: clarify Quickstart POST example (<a
href="https://redirect.github.com/psf/requests/issues/6960">#6960</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/psf/requests/compare/v2.32.5...v2.33.0">compare
view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.5&new-version=2.33.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
2026-03-26 22:01:37 +00:00
joyeamd
046d3ac274 [rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6)
[CK][CK_TILE] Revert addional oob check in gemm IsSupported
 function (#5789)

## Motivation

fix ck_tile's oob check.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-26 01:41:35 +00:00
Estevan Vedovelli
0004a37de5 [rocm-libraries] ROCm/rocm-libraries#5675 (commit fbd7fa7)
[CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used
 in addition to MIOPEN_REQ_LIBS_ONLY (#5675)

## Motivation

When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and
-DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly
installed. This change is necessary to allow hipTensor to build with
TheRock without the need to rebuild CK from source.

## Technical Details

The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and
MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake,
following the same approach used to the conv target (for MIOpen).

## Test Plan

Manually test the build and installation with
`-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON
-DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as
installed.

## Test Result

The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the
targets contraction, reduction and other, while
`-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv,
contraction, reduction and other.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-25 23:59:53 +00:00
Illia Silin
86ec92f925 [rocm-libraries] ROCm/rocm-libraries#5571 (commit 8f60932)
[CK] fix clang lifetime bound error in ck_builder.

## Motivation

This resolves the compilation error with latest develop compiler branch.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-25 16:45:38 +00:00
Illia Silin
bee61860c2 [rocm-libraries] ROCm/rocm-libraries#5764 (commit f3c1232)
Re-enable daily builds with staging compiler

## Motivation

This should help us catch and fix any new compilation issues early on.

## Technical Details

We now have three compiler profiles:
* **develop**: slightly stabilized version of amd-staging with some of
the obvious offending PRs reverted, 1-2 weeks behind amd-staging;
* **amd-mainline**: more stable version of compiler, the baseline for
all other branches, e.g., release, npi, etc. 2-4 weeks behind
amd-staging.
* **amd-staging**: latest compiler version where all new PRs land, often
broken;

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: kensclin <lshyhchy@amd.com>
2026-03-25 16:37:58 +00:00
Ville Pietilä
ec2dbfbfde [rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda)
[CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler
 (#5516)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things

- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.

## Technical Details

Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.
2026-03-25 14:36:11 +00:00
joyeamd
1834e318da [rocm-libraries] ROCm/rocm-libraries#5697 (commit dd1c396)
Revert "Ck/joye/revert oob check (#5640)"

This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-23 22:05:30 +00:00
andrew clark
5a4243096b [rocm-libraries] ROCm/rocm-libraries#5713 (commit e179279)
Adding New Notification Detection

## Motivation

Restricting one of the notification failure patterns to match a specific
missing drivers log pattern. This will help reduce the noise of
erroneous logs. Also adding a new failure pattern to notify us of Github
access issues.

## Technical Details

- Set the failure pattern to match the exact failure observed in the
logs.
- Switching to a plain substring search so special characters are
handled literally.
- Added a new failure pattern for Github access errors.

## Test Plan

- Force a failure using the known failure patterns.

## Test Result

The forced failures were triggered and caught by the notification
system.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-23 20:57:55 +00:00
Eiden Yoshida
ba2fb0224f [rocm-libraries] ROCm/rocm-libraries#5691 (commit 2fbb1fc)
[CK] MICI: Revert "add self healing to ref repo"

The check may not be working as intended, causing premature deletion of
reference repositories
2026-03-23 14:16:53 +00:00
Bartłomiej Kocot
f79926009b [rocm-libraries] ROCm/rocm-libraries#5555 (commit 1d2c4c8)
[CK][CK Tile] Fix kbatch check in grouped conv and gemm
 kernels (#5555)

## Motivation

Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.

## Technical Details

Round up K / Kperxdl and divide it by Kbatch to allow tail for K.

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-21 22:56:19 +00:00
Emily Martins
6b69ac9676 [rocm-libraries] ROCm/rocm-libraries#5625 (commit 7d2ed43)
[CK_TILE] Prune Stream-K Tile Engine Tests

## Motivation

Stream-K tile engine tests are causing issues for build time. While we
work on a more permanent solution, these changes prune the Stream-K test
instances to help reduce the build time burden.

## Technical Details

The Stream-K team recently transitioned to using CK Tile's tile engine
infrastructure for our smoke tests. However, since tile engine creates
an individual target per kernel instance, we've found that the tile
engine tests are increasing build times. Our team is currently working
to convert our existing tile engine tests back to basic gtests. While
this work takes place, we are temporarily pruning the existing Stream-K
tile engine test instances to help reduce the build time burden.

## Test Plan

Ran the pruned test set on all gfx90a, gfx942, and gfx950.

## Test Result

All tests pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 20:31:39 +00:00
andrew clark
a66047ad09 [rocm-libraries] ROCm/rocm-libraries#5464 (commit debfc96)
Improved CI infrastructure failure detection

## Motivation

This PR re-enables CI infrastructure failure detection and notification,
which had been disabled due to performance issues caused by loading
large build logs (~80k lines) into memory for pattern scanning. The goal
is to reliably detect known infrastructure failures (GPU errors, Docker
authentication issues, disk space errors, etc.) and send actionable
Teams notifications without hanging on large logs.

## Technical Details

- Replaced full build log loading and Groovy-based pattern scanning with
a streaming wget | grep -E pipe. grep scans natively so the full log is
never loaded into Groovy, resolving the hang on large logs.
- Combined all failure patterns into a single grep -E call to avoid
multiple log fetches.
- The node name is now tracked with the observed failure.
- Added a new failure pattern for device's running out of space.

## Test Plan

- Forced failures in the "Determine CI Execution" stage with all 9
failure patterns echoed to the build log.
- Simulated large log sizes (~80k lines of dummy output) to validate
pattern detection and node name extraction at realistic log scales,
including patterns placed both before and after large blocks of dummy
output.

## Test Result

All 9 failure patterns detected correctly. Teams notifications sent with
accurate log context, node name, and job links. No hangs observed on 80k
line simulated logs.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 19:18:07 +00:00
Jobbins
e8f57c0159 [rocm-libraries] ROCm/rocm-libraries#5630 (commit 14cd617)
add self healing to ref repo

## Motivation

Check for when mirror repo gets corrupted in CI

## Technical Details

We detect broken ref objects and rebuild the local mirror in that case
of corruption

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 16:43:37 +00:00
Bartłomiej Kocot
db40d3f517 [rocm-libraries] ROCm/rocm-libraries#5334 (commit bb5a3c8)
[CK][CK Tile] Improve access for merged groups and remove
 modulo from xor (#5334)

## Motivation

[CK][CK Tile] Improve access for merged groups and remove modulo from
xor

## Technical Details

- add template parameter to xor if modulo is needed. We don't need
modulo for merged groups
- use access by m for merged groups for a tensor
-
## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 15:47:22 +00:00
Bartłomiej Kocot
fd8714aea9 [rocm-libraries] ROCm/rocm-libraries#5609 (commit 95afb2c)
[CK][CK Tile] Move grouped conv cpp instances to build dir
 (#5609)

## Motivation

Move grouped conv .cpp instances to build dir. Fix generate instances
script.

## Technical Details

Avoid CI problem when instances in experimental directory are not
removed

## Test Plan

test_grouped_convnd_*_tile

## Test Result

Pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 13:20:48 +00:00
joyeamd
a22c822aef [rocm-libraries] ROCm/rocm-libraries#5640 (commit 552ab48)
Ck/joye/revert oob check

## Motivation

fix ck_tile's oob check.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 12:31:27 +00:00
arai713
da863dae1b [rocm-libraries] ROCm/rocm-libraries#4795 (commit 6590a1a)
[CK_TILE] Rename Stream-K grid function

## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().

## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.

## Test Plan
Validated using the test suite that is already present.

## Test Result
All tests passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 09:28:47 +00:00
yinglu
a268a2a2e1 [rocm-libraries] ROCm/rocm-libraries#5612 (commit 38c9498)
[CK]fix: remove redundant structured sparsity check in
 run_gemm_example.inc (#5612)

## Motivation

This issue if found via
https://github.com/ROCm/rocm-libraries/pull/4302#discussion_r2958603418
and is introduced via https://github.com/ROCm/rocm-libraries/pull/5323.

## Technical Details

The outer `if` and inner `if constexpr` both checked
GemmConfig::UseStructuredSparsity. Merged into a single `if constexpr`
since both preshuffle and UseStructuredSparsity are compile-time
constants.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 08:23:07 +00:00
Sami Remes
d7c761e060 [rocm-libraries] ROCm/rocm-libraries#5095 (commit 7e55766)
[CK_TILE] Enable MXFP6 for MX GEMM op

## Motivation

Add support for MXFP6 in the MX GEMM op in CK-Tile.

Depends on https://github.com/ROCm/rocm-libraries/pull/4594

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 01:08:52 +00:00
Yaswanth Raparti
a5d0200ccf [rocm-libraries] ROCm/rocm-libraries#5614 (commit 32933df)
[CK][CK TILE] Fix smart-build to run install target for
 client examples (#5614)

How ninja install works:
- Builds library dependencies (device_operations, etc.)
- Installs them to CMAKE_INSTALL_PREFIX
- Skips building test executables (not install dependencies)

Affected stages (8):
- gfx942/gfx950/gfx908/gfx90a CK Client Examples
- gfx10-1/gfx10-3/gfx11/gfx12 CK Client Examples

## Motivation

Problem: When smart-build is enabled (runAllUnitTests=false), the build
step is skipped entirely. This causes client example stages to fail
because they depend on the CK library being installed to ../install.

Error seen:
  Target "client_gemm" links to:
    composable_kernel::device_other_operations
  but the target was not found.

## Technical Details
Root cause: Line 712 only checked runAllUnitTests, so when building with
config_targets="install", the install target was never built, leaving
the install directory empty.

Fix: Added condition to always build when config_targets contains
'install'. The install target automatically builds its dependencies (the
CK libraries) but skips building tests, which aligns with smart-build
philosophy.

## Test Plan

Should be tested on CI

## Test Result

Should be tested on CI

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-19 22:02:03 +00:00
yinglu
d460ab35b6 [rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a)
[CK_TILE] add tf32 support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [ ] Any dependent changes have been merged

## Discussion
2026-03-19 09:19:06 +00:00
Yaswanth Raparti
652d3456ca [rocm-libraries] ROCm/rocm-libraries#5249 (commit 2a114bb)
[CK] [CK_TILE] Improve build and test time of CI with smart
 dependency parser (#5249)

## Motivation

Existing dependency parser needs full build of tests to determine which
tests are affected by code changes in a PR. This still takes 2-4 hours
for building the tests which slows down the CI as the number of tests
grow. To resolve this issue we implemented a smart dependency parser
which uses CMake Configure to parse dependencies and build only the
affected test cases. We have ensured that two approaches are available
1) CMake pre-build analysis for each PR to ensure fast build and test.
2) Ninja post-build analysis to enable full build for nightly tests.

## Technical Details

```bash
### 1. Configure the project with CMake
cmake -G Ninja -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ..

### 2. Analyze dependencies (no build required!)
python3 ../script/dependency-parser/main.py cmake-parse compile_commands.json build.ninja \
  --workspace-root .. --output cmake_dependency_mapping.json --parallel 8

### 3. Find tests affected by changes
python3 ../script/dependency-parser/main.py select cmake_dependency_mapping.json origin/develop \
  HEAD --test-prefix --output tests_to_run.json

### 4. Build only affected tests
ninja $(jq -r '.executables[]' tests_to_run.json | tr '\n' ' ')

### 5. Run affected tests
ctest -R "$(jq -r '.regex' tests_to_run.json)"
```

### Jenkins Integration
- Added `buildMode` to jenkinsfile to integrate both `selective` and
`full` build methods

### Known Limitations

### 1. Build-Time Generated Headers (HIGH RISK)

**Problem:** Files generated during the build process (e.g., via
`add_custom_command`) cannot be analyzed before building.

**Example:**
```cmake
add_custom_command(
  OUTPUT ${CMAKE_BINARY_DIR}/generated/config.hpp
  COMMAND generate_config.sh
  DEPENDS template.hpp.in
)
```

**Impact:** If a source file includes `generated/config.hpp`, the
dependency won't be detected until after building.

**Mitigation:**
- CK analysis shows **no generated headers** currently used
- If generated headers are added in the future, they must be built first
- Recommendation: Generate headers in CMake configure phase (not build
phase) when possible

## Test Plan
**1. Modified Files:**
```
include/ck_tile/ops/common.hpp
include/ck_tile/ops/gemm.hpp
include/ck_tile/ops/gemm/warp/warp_gemm.hpp
```
**2. Compare tests selected between `build.ninja` and `cmake-parse`
methods**

## Test Result
- 1. The test completed in 5-6 minutes finding about 8000+ executables
that should be built.
- 2. We selected a commit 5ccc1387ea which resulted in same 7 tests with
both legacy and new methods.
-

PR | Legacy tests | Smart tests | Notes
-- | -- | -- | --
5261 | 453 | 455 | Only 2 tests (test_amdgcn_mma and
test_amdgcn_sparse_mma)
5168 | 0 | 0 | Changes in   dispatcher only. No CK tests invoked.
5249 | 0 | 0 | Changes to   dependency parser. No CK tests invoked
5260 | 0 | 0 | Changes in   dispatcher only. No CK tests invoked.
5174 | 1 | 1 | One test from FMHA   affected by this PR in both cases
5383 | 0 | 0 | Changes are only in benchmark files. Did not trigger any
tests
5445 | 1 | 1 | Changes are only to tests/ck_tile/gemm_streamk. Only
triggered one streamk test in both cases.
5454 | 3 | 3 | Both methods identified same test_grouped_conv_bwd tests
5427 | 234 | 234 | Core infrastructure header changes. Detected exactly
same tests
5388 | 85 | 85 | modifies warp-level GEMM operations (warp_gemm.hpp,
warp_gemm_dispatcher.hpp). Correctly identified all the streamK gemm
tests

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-19 05:31:35 +00:00
lalala-sh
345a56c55e [rocm-libraries] ROCm/rocm-libraries#5086 (commit f4880d7)
[CK] Fix MOE FP8 SplitK buffer descriptor OOB

When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base
pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces
are still based on full K dimension. This causes hardware buffer
resource descriptors to extend beyond the actual tensor allocation,
leading to GPU memory access faults when the tensor happens to be placed
at the end of an allocated memory pool region.

Fix by subtracting the split offset from each buffer's element space in
both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer
descriptor range [shifted_base, shifted_base + reduced_space) exactly
covers the valid allocation.

Also refactor SplitKBatchOffset to accept const Problem& (instead of
Argument&) and add a default constructor, enabling direct reuse in
Run/Run_2Lds without duplicating offset calculation logic.

Made-with: Cursor

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-19 02:43:30 +00:00
Christopher Millette
e5683e2290 [rocm-libraries] ROCm/rocm-libraries#5031 (commit 1d86a92)
[CK] Replace nested static_for with static_ford to reduce
 device IR function emissions [1B] (#5031)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

### Rationale
CK's GPU kernels are among the slowest files in the ROCm build, with a
single translation
unit taking up to 10+ minutes. Profiling with `-ftime-trace` identified
nested `static_for`
loops as the root cause: each nesting level multiplies the number of
unique lambda IR
functions the compiler must process. A 2-level nest of `static_for<0, M,
1>` /
`static_for<0, N, 1>` produces M×N unique lambda types. With typical
GEMM dimensions
(M=16, N=4), a single nest generates 64 unique functions — and these
nests appear hundreds
of times across the codebase.

The LLVM backend's CGSCC (Call Graph Strongly Connected Components)
framework processes
each function independently, so reducing function count directly reduces
backend time.

### What changed
393 nested compile-time loop patterns across 73 files are converted to
`static_ford`, which
flattens multi-dimensional compile-time iteration into a single
`static_for` with index
decomposition. This eliminates 994 `static_for` nesting levels (42%
reduction).

Three pattern categories were converted:
- **Category A**: `static_for` wrapping `static_ford` — fold outer
dimension into ford
- **Category B**: nested `static_ford` — merge into single
higher-dimensional ford
- **Category C**: nested `static_for` chains — convert to single
`static_ford`

### Verification

**ASM equivalence: PASS — 51/51 device assembly files identical (gfx942
+ gfx1100)**

| Architecture | Files compared | Largest file | Result |
|---|---|---|---|
| gfx942 | 36 | 386,685 lines | ALL MATCH |
| gfx1100 | 15 | 47,769 lines | ALL MATCH |

**Build time (Wilcoxon signed-rank test, 7 paired trials):**

| Target | Pre (s) | Post (s) | Delta | p-value |
|---|---|---|---|---|
| bscale | 169 | 152 | **-9.8%** | 0.016 \* |
| xdl_v1234 | 207 | 194 | **-6.6%** | 0.016 \* |
| preshuffle | 275 | 264 | **-3.9%** | 0.016 \* |
| xdl_base | 142 | 137 | **-3.2%** | 0.031 \* |

**IR function counts (device backend, gfx942):**

| Target | InstFunc Δ | CodeGen Δ | Compiler Δ |
|---|---|---|---|
| bscale | -13,043 (-8.2%) | -2,103 (-3.5%) | -10.7% |
| xdl_v1234 | -9,431 (-5.7%) | +59 (+0.1%) | -5.2% |
| xdl_base | -6,162 (-4.9%) | -1,141 (-2.5%) | -2.2% |
| xdl_old | -3,234 (-3.7%) | -963 (-8.7%) | -3.3% |

### Value
- **994 fewer `static_for` nesting levels** (-42%) across 73 files
- **393 `static_ford` sites** created (from 4 pre-existing)
- **Up to 9.8% compile-time reduction** on representative targets
(statistically significant, p < 0.05)
- **Up to 13K fewer IR function instantiations** per translation unit
- Net -849 LOC from reduced indentation
- **Zero ASM changes** — identical device code output verified on gfx942
and gfx1100
- All scheduling barriers, `if constexpr` guards, and MFMA/WMMA
accumulation order preserved

### Files changed (73)
- `block/`: 47 files (GEMM pipelines — xdlops, wmma, moe, preshuffle,
blockscale variants)
- `grid/`: 20 files (softmax, normalization, reduction, attention,
layernorm)
- `thread/`: 5 files (tensor slice transfer, contraction, GEMM dlops,
reduction)
- `tensor_description/`: 1 file (tensor_adaptor)

## Test plan
- [x] `static_ford` tested with 21 unit tests in
`test/util/unit_ford.cpp`
  (1D-4D, custom orders, compile-time verification)
- [x] All conversions preserve iteration order, `block_sync_lds()`
placement,
  `if constexpr` scheduling guards, and MFMA/WMMA accumulation order
- [x] ASM equivalence verified: 51 device `.s` files across gfx942 +
gfx1100
- [x] Build-time improvement statistically confirmed (Wilcoxon, p <
0.05, 4 targets)
- [x] IR function count reduction confirmed via `-ftime-trace` on 7
targets
- [x] Detection script reports 0 remaining safe patterns (180 blocked
with structural reasons)
- [x] Existing CI tests (GEMM, softmax, normalization, batch norm,
reduction,
  attention) exercise all converted code paths

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-18 14:46:50 +00:00
Thomas Ning
5f90f69795 [rocm-libraries] ROCm/rocm-libraries#5323 (commit 5454e9e)
CK Tile MX GEMM Packing Improvement

## Motivation

Reduce the scale loading size and also has better utilization of MFMA
scale selection.

## Technical Details

Add up the packing of mx scales.

## Test Plan

Use the existing test cases.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-17 18:58:56 +00:00
Hosang
859acb5ae7 [rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6)
[CK_TILE] Add LLC-aware FMHA head grouping and head-major
 scheduling on RDNA (#5018)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.

This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.

In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.

## Technical Details
This PR adds two complementary strategies:

- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
  - Estimate LLC size (env override, KFD sysfs, or arch default).
  - Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).

- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
  - Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
     improving temporal locality of per-head K/V reuse.

Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
    deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.

## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`

## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x

- TFLOPs by sequence length target: gfx1201 layout: bshd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 21:19:23 +00:00
Bartłomiej Kocot
9c414d2e59 [rocm-libraries] ROCm/rocm-libraries#5454 (commit 8dade31)
[CK][CK Tile] Grouped Convolution backward weight profiler
 flush cache (#5454)

## Motivation

Flush cache to get more stable results during profiling old ck and ck
tile.

## Technical Details

Flush cache before each kernel call and one more first run.

## Test Plan

test_grouped_conv_bwd_weight_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966
2026-03-16 17:47:07 +00:00
lalala-sh
a3ccd5dca1 [rocm-libraries] ROCm/rocm-libraries#5225 (commit 880166b)
[CK] fix moe memset size which is bigger than alloc

## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 09:30:57 +00:00
Enrico Degregori
eb033ef208 [rocm-libraries] ROCm/rocm-libraries#4964 (commit 3271d9a)
[CK Tile] Eight Waves pipeline GEMM

## Motivation

Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM

## Technical Details

Summary:
 - Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
   - Use this block struct in ABQuant for encodings
 - Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
 - Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves

So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts

## Test Plan

Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).

Note: K padding test is disabled for this pipeline because it's not
implemented yet

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 08:31:56 +00:00