[CK] Dockerfile: auto-discover latest TheRock nightly tarball
(#6972)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Our docker containers with `--build-arg compiler_version=therock` should
have the latest nightly build of TheRock in `/opt/rocm`. When I looked
for `rocm_kpack` and other `kpack` artifacts, they were missing, and I
realized we had pinned the version by date. Instead, we should look for
the most recent linux-multiarch tarball.
## Summary
- Auto-discover the latest TheRock nightly tarball at Docker build time
instead of pinning a stale URL (previously hardcoded to a Feb 2026
nightly that predates kpack)
- Logic is to `wget` the directory, and identify the latest tarball
(alphabetically sorted by YYYYMMDD in filename).
- Support manual override via `--build-arg TARBALL_URL=...` for pinning,
and `--build-arg TARBALL_PATTERN=...` for selecting a specific arch
variant
- Fix sccache download URL: `/releases/latest/download/` was redirecting
to v0.15.0 but the filename referenced v0.14.0, causing a 404
## Test plan
- [x] Verified tarball discovery logic resolves to
`therock-dist-linux-multiarch-7.13.0a20260430.tar.gz`
- [x] Built Docker image locally with `--build-arg
compiler_version=therock`
- [x] Confirmed sccache installs successfully with the fixed URL
- [ ] Verify CI pipeline builds with the updated Dockerfile
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
[CK] restore fmha performance reporting and disable c++17 in
CI. (#6741)
## Motivation
This change restores monitoring of FMHA benchmarks performance in daily
builds and removes the std=c++17 flag from CI builds on gfx90a.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Improve the performance of qr_ks_vs_whole_k_prefetch pipeline
(#6209)
## About qr_ks_vs_whole_k_prefetch pipeline
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to
improve performance on both MI350 GPUs through better MFMA instruction
usage, transposed V-loading support, and N0-loop implementation. The
pipeline targets scenarios where the number of workgroups is low,
enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs
128) while prefetching entire K tiles.
## Changes:
- Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload)
to avoid using shuffle instructions on MI350
- Implements N0-loop based Gemm0 to reduce tile window movement overhead
and eliminate `clear_tile` calls
- Adds full support for hdim96/hdim160 without padding requirements
- Updates MFMA instruction selection to ensure optimal choices for MI350
## Performance results
1. For attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance
than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms`
by whole_k_prefetch_trload & `58.50ms` by async_load), and
`qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better
performance than the recently tuned `qr_ks_vs_async` on the same case
(execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by
qr_ks_vs_async)
2. Also on MI300, for attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch` shows much better performance than the
`qr_ks_vs_async` (which is supposed to be very high-efficient) on the
same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by
qr_ks_vs_async)
3. For attention shapes which leads to kM0=128,
`qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better
performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by
whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows
completely on-par performance on MI300
## Test/Verify
1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to
test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can
not be used by ck_tile fmha example so far
2. Use the following command-line for building/testing xformers
>```bash
> #> git clone -b test_whole_k_prefetch_n0loop
https://github.com/ROCm/xformers
> #> git submodule update --init --recursive
> #> pip install --no-build-isolation -e ./
> #> pytest tests/test_mem_eff_attention.py::test_forward
>```
4. Any scripts which can run on xformers can be used to evaluate
qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to
switch from using different pipelines
> ```bash
> #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3
and qr_ks_vs_async_trload pipeline
> #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using
qr_ks_vs_async pipeline for comparing
> ```
## Discussion
[CK] Fix CI Failures for PR From Forks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Fork PRs fail CI when `RUN_AITER_TESTS` or `RUN_FA_TESTS` is enabled.
The docker scripts run `git clone -b "$CK_*_BRANCH"
https://github.com/ROCm/rocm-libraries.git`, but a fork's branch doesn't
exist upstream:
```
fatal: Remote branch <fork-branch> not found in upstream origin
```
Example: [PR #6529 build
#4](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6529/4/pipeline).
## Technical Details
**`Jenkinsfile`** — for PRs, use the upstream-visible PR ref instead of
the head branch name:
```groovy
CURRENT_BRANCH_NAME = env.CHANGE_ID
? "refs/pull/${env.CHANGE_ID}/head"
: (env.CHANGE_BRANCH ? env.CHANGE_BRANCH : env.BRANCH_NAME)
```
**`Dockerfile.aiter` / `Dockerfile.fa`** — `git clone -b <ref>` only
accepts branches (`refs/heads/*`) and tags (`refs/tags/*`), so it can't
resolve `refs/pull/N/head`. Switch to `git fetch`, which accepts any
refspec (and still works for plain branch names):
```sh
mkdir rocm-libraries && cd rocm-libraries
git init -q
git remote add origin https://github.com/ROCm/rocm-libraries.git
git fetch --depth 1 --filter=blob:none origin "$CK_*_BRANCH"
git sparse-checkout init --cone
git sparse-checkout set projects/composablekernel
git checkout FETCH_HEAD
```
`git checkout FETCH_HEAD` lands in detached HEAD, which breaks the
existing `git branch -m "$CK_*_BRANCH"` (and that name isn't a valid
local branch anyway). Decouple the local branch name from the upstream
ref:
- Replace `git init` + `git branch -m` with `git init -b
"$LOCAL_BRANCH"` (requires git ≥ 2.28, satisfied by base images)
- `LOCAL_BRANCH="ck-import-${ROCM_LIBRARIES_SHA}"` in the rocm-libraries
path; `LOCAL_BRANCH="$CK_*_BRANCH"` in the fallback
- Downstream `git clone -b ... ../ck` uses `$LOCAL_BRANCH`
## Test Plan
Manually trigger a build on this PR with `RUN_AITER_TESTS=true` and
`RUN_FA_TESTS=true`; both docker images should build end-to-end.
## Test Result
[jenkins / rocm-libraries-folder/Composable Kernel / PR-6701 /
#3](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6701/3/pipeline/)
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] fix(fmha): support >2GB KV cache in batch prefill
via template dispatch (#6653)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
The CK batch prefill kernel previously failed (silent overflow + page
faults) when the KV cache exceeded 2 GB, blocking long-context inference
workloads (e.g., 128K+ token contexts with paged KV).
Two distinct failure modes were addressed:
1. **>4GB SRD overflow (`page_size < kN0`):** The SRD
`buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small
page sizes the rebased SRD spans the full KV pool and the offset wraps
past 2 GB, corrupting K/V loads.
2. **gfx950 page-table fault (`page_size >= kN0`):** On CDNA4 the
hardware validates the **full SRD `num_records` range** against
page-table permissions (CDNA3 only checks per-instruction `voffset`).
After per-tile SRD rebase, an un-trimmed `num_records` field extends
past the live page and faults on freed/protected memory.
## Technical Details
**Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad`
template parameter:**
| Case | `page_size` | KV cache size | Mode | Load path | Addressing |
|---|---|---|---|---|---|
| 1 | `>= kN0` (large pages) | any | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, bounded by per-page rebase |
| 2 | `< kN0` (small pages) | `<= 2 GB` | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, fits in INT32 byte range |
| 3 | `< kN0` (small pages) | `> 2 GB` | Global-load
(`kUseGlobalLoad=true`) | `async_load_tile_raw_flat` (K) +
`load_tile_flat` (V) | 64-bit |
**Dispatch:** the auto-gen API layer (`fmha_batch_prefill.py`) selects
the kernel instantiation at launch from `(page_block_size,
num_total_pages * batch_stride_k * kElementBytes)`, so the small-page
penalty is paid only when correctness requires it.
**gfx950 SRD `num_records` trimming:** in the K and V rebase lambdas of
`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`,
`set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after
each rebase to constrain `num_records` to the live page. Required for
CDNA4 page-table validation; harmless on CDNA3.
**Pipeline sync for the global-load path:**
- V uses synchronous `load_tile_flat`; K uses
`async_load_tile_raw_flat`.
- `v_physical_pages_current` is double-buffered so the V flat load
doesn't race against the next iteration's K rebase computation.
**Arch guards:** `global_load_lds` intrinsics are gated to `__gfx94__` /
`__gfx950__` (CDNA3+). Other architectures hit a `dependent_false`
static_assert with a descriptive message.
**Device-side assertion convention:** SRD setters use
`__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s
`assert()`. The latter introduces an `__assert_fail` call whose register
pressure scatters the K-SRD scalar register window across conditional
branches, corrupting `buffer_load_dwordx4` on gfx950.
## Test Plan
Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper
test suite. All coverage lives in **`op_tests/test_batch_prefill.py`**:
- **Functional matrix (96 cases)** — `test_batch_prefill`: `page_size ∈
{1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16,
fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1,
4}` (parametrized to exercise per-sequence SRD rebase across batch
boundaries).
- **>2 GB coverage** — `test_batch_prefill_large_kvcache`: extended to
allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true`
(small-page) and `kUseGlobalLoad=false` (large-page rebase) paths.
Includes both single-batch and multi-batch (`batch_size=4`) cases to
exercise per-sequence SRD rebase across the >2 GB pool.
- Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol`
from the existing batch prefill test harness.
## Test Result
| Arch | `test_batch_prefill` | `test_batch_prefill_large_kvcache` (>2
GB) |
|------|----------------------|---------------------|
| MI308 (gfx942) | All passed | Passed |
| MI355 (gfx950) | All passed | Passed |
**Performance impact (gfx950, hot SRD path):**
- +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang /
causal / soft_cap=30`, attributable in full to the two
`set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas
(5-run median, signal/noise ≈ 9×).
- This cost is **mandatory for gfx950 correctness** on >2 GB workloads —
removing the setters re-introduces page-faults.
- gfx942: 0 regressions in the same range (all configs ≤ +0.97%).
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix out of bounds modifications caused by negative
topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242)
## Motivation
Fix sglang randomly crash by filter negative topk ids.
## Technical Details
In sglang expert parallel mode, there may be idle batch (batch=0) fired,
it will reuse batch=1 resource in cuda graph mode. But in topk op, it
will set non used topk ids to -1, in idle batch case, all topk ids are
set to -1. In `MoeSortingMultiPhaseKernel_P0_v1` negative expert id will
cause overwrite somewhere and sglang may randomly crash.
Except idle batch case, if the captured batch sizes are discrete, there
may be -1 of expert id due to the similar logic.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: zovonoir <jialzhu@amd.com>
Fix per-layer conv2d int8 CPU verification reference path
(#6656)
case example_conv2d_fwd_xdl_perlayer_quantization_int8.exe 1 0
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Add missing lds sync
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Grouped Convolution Backward Data Direct Load
(#6624)
## Proposed changes
Add Grouped Convolution Backward Data with Direct Load into
DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation.
This enables direct global memory loading (bypassing LDS) for the
backward data convolution path on gfx950, following the same pattern
used in both backward weight and forward convolution.
Direct load convolution backward data improves performance by avoiding
LDS round-trips for certain configurations on gfx950, which supports a
wider range of instructions. Currently correctness is checked only at
usage point, but should be extended to a standalone UT in the future.
Fix ck4inductor conv instance parsing for NumGroupsToMerge
parameter (#6434)
## Summary
- Add `num_groups_to_merge` field to `CKGroupedConvFwdOp` dataclass to
match the new (#4273) `NumGroupsToMerge` template parameter added to
`DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3`
- Enable inductor tests by default in Jenkins CI
## Test plan
- [x] Built wheel without patch: `test_gen_conv_instances` fails with
`TypeError: takes from 47 to 50 positional arguments but 51 were given`
- [x] Built wheel with patch: `test_gen_conv_instances` passes
[CK_TILE] Preserve input strides in EightWaves async-load
descriptor (#6611)
`MakeAsyncLoadADramWindow` in
`GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view
descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes
strides from lengths and assumes a dense layout. When the input view's
leading-dim stride is larger than its inner length (non-packed memory
layout), the resulting tile window stepped through memory at the wrong
stride.
Compose the unmerge transforms on top of the input view's existing
descriptor instead, so the actual runtime strides are preserved and the
correct `element_space_size` is inherited for bounds checking.
## Test Plan
Added an unit test showing the problem.
## Test Result
The new test fails before fixes and passes after.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Add conv bwd data tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
This PR adds tests for CK Tile's convolution backward data operation to
enable functionality regression tracking and error-detection.
## Technical Details
Currently only NHWGC/GKCYX/NHWGK and NDHWGC/GKCZYX/NDHWGK(2 dim and 3
dim channel-last) layouts are being tested, since only they are
implemented in CK Tile. Current tests support FP16, BF16 and FP32
datatypes and various different convolutions scenarios. The tested
instances are listed in
`experimental/grouped_convolution_tile_instances` directory.
## Test Result
All implemented tests are working properly and passing.
[CK] Fix for hipblaslt error in PyTorch Dockerfile
## Motivation
This PR fixes the hipblaslt client build failures that occur when
building the PyTorch Docker image, which are currently causing failures
in CI.
## Technical Details
- Correctly reset the working directory to tmp/
- Added --use-system-packages to the install.sh to use system installed
laplack packages, as hard-coded paths were not being built.
## Test Plan
Locally built the Docker image using the Dockerfile.
## Test Result
Image was successfully built.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
CK][fmha] Add StreamLLM sink support to batch_prefill
pipeline (#6479)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
The existing paged-KV attention pipelines (pagedkv, splitkv) support
StreamLLM-style sink tokens — a fixed set of initial tokens kept in
attention alongside the sliding window. The `batch_prefill` pipeline
(chunked-prefill with VLLM-style block tables) previously hardcoded
`kHasSink = false`, making it incompatible with sink-based attention
patterns in LLM serving scenarios.
This PR extends `batch_prefill` to support `kHasSink` and wires it
into `fmha_fwd_runner` for validation against the existing CPU
reference.
## Technical Details
**Pipeline** (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`):
- When `kHasSink`, the K/V loop splits into a sink phase [0,
sink_seq_end)
and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv.
- K advance at the sink→window transition jumps
`seqlen_k_start - sink_seq_end + kN0` to bridge the gap.
- V scatter-gather offsets are re-initialized at the transition to fix a
window mismatch bug: V was lagging kN0 behind K after the large jump,
loading from the wrong sequence position.
- Bias window, dropout seq_offset, and mask type (LogitsSinkMask)
updated
for sink-awareness.
**Traits / codegen** (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`,
`fmha_batch_prefill.py`):
- `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded
`false`).
- Codegen adds `F_sink` field; skips batch-mode kernels (group mode
required).
- CMake test filter broadened from 9 → 33 instances covering
fp16/bf16 × mask/nmask × lse/nlse × sink/nsink.
**Runner** (`fmha_fwd_runner.hpp`, `CMakeLists.txt`):
- `fmha_batch_prefill()` dispatched from `run_fwd` when:
group mode + paged KV + num_splits == 1.
- K/V strides corrected for runner's [num_pages, nhead_k,
page_block_size, hdim] layout.
- `page_block_size % 128` check relaxed: batch_prefill supports ps=16.
- CPU reference paged-KV reordering guards extended with
`CK_TILE_FMHA_FWD_BATCH_PREFILL_API`.
## Test Plan
Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run
`tile_example_fmha_fwd` in group mode with page_block_size=16.
Test matrix:
- Mask: no-mask, causal, sliding window
- Sink: nsink, sink=1..128
- dtype: fp16, bf16
- LSE output: on/off
- seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024}
- GQA, chunked prefill, large batch×seqlen
- page_block_size: 16, 32
## Test Result
171 test cases, all valid:y:
- nmask + nsink: ✓
- causal + nsink: ✓
- causal + sink=8: ✓
- sliding window + sink=8 (d=128, d=256): ✓
- bf16, LSE output, GQA: ✓
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
projects/composablekernel: add SwigluStep support for MoE
blockscale (#6118)
## Summary
- add `swiglustep_and_mul` to the composablekernel MoE blockscale
activation enum
- implement the corresponding blockscale epilogue path for `SwigluStep`
- keep existing `silu` and `gelu` paths unchanged
## Scope
This PR covers the classic composablekernel blockscale MoE path under
`projects/composablekernel`.
This is separate from the `ck_tile` / FlatMM path being discussed in
ROCm/rocm-libraries#5992.
## Motivation
`Step-3.5-Flash-FP8` uses `SwigluStep` in its MoE MLP path. The
dependent AITER change needs native support for this activation in the
classic composablekernel MoE blockscale path.
## Validation
- patch is limited to two composablekernel files under
`projects/composablekernel`
- existing `silu` / `gelu` paths are unchanged
- dependent AITER runtime validation hit the classic CK 2-stage path
with AITER MoE enabled
[CK] Remove code duplications in grouped gemm fixed nk
implementations (#4961)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Different flavours of grouped gemm fixed nk implemenations share the
same block to tile mapping logic. Despite that the code responsible for
it is duplicated in each device struct implementation.
- Move `BlockToCTileMap_KBatch_M00_N0_M01Adapt_MLoops` and
`OffsettedBlockToCTileMapMLoops` from the device struct implementations
to a common header file.
- Use the generic Kernel Argument structures in xdl versions of the
fixed nk.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
CI in general. Relevant test and examples are all fixed_nk versions of
grouped gemm multiple D and ABD.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[MIOpen][CK] Fix bwd weight conv test failures by disabling
one block-GEMM V5 instance for 3D convs (#6421)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Due to compiler version update, there are test failures in the test
target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current develop branch, the test
failures are blocking any PR merges into develop. An example of a failed
CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as the tests are passing for clang compiler version 20.0 and
failing for clang compiler version 22.0.
First attempt to fix this problem had to be reverted in #6400 because it
broke MIOpen internal DB sync tests.
## Technical Details
The root cause for the test failures are the block-GEMM V5 instances of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size.
The V5 pipeline uses double register buffer that in combination with
large tile size causes high register pressure. The latest version of
compiler handles the register spillage incorrectly for `gfx90a`, which
cause the kernel to output incorrect results.
The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3`
that do not use direct load for are divided into two groups
- Base instances
- Instances that result into high register usage (currently only one
instance - one that causes the test failures).
This division allows to disable only the V5 block-GEMM flavor of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The
selective disabling leaves the set of instances for 1D and 2D
convolutions unaffected, and removes at runtime two V5 block-GEMM
instances (`ConvBwdWeightDefault` and
`ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the
device is `gfx90a`.
Because MIOpen uses CK's type string (provided by method
`GetTypeString`) to identify the instances, the DB sync tests are
expected to unaffected since there are still the V2 block-GEMM instances
that result in the same type string
(`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by
running the MIOpen DB sync tests that are not part of the normal CK PR
build.
## Test Plan
Running all CI tests + the MIOpen internal DB sync tests is sufficient
to verify the correctness of the code changes.
## Test Result
Verified locally that the previously failing tests
`TestGroupedConvndBwdWeight3d/4.Test3D` and
`TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts
- 231 on `gfx90a`
- 233 on `gfx942`
and are currently passing. This confirms the expectation that two
instances per data type should be disabled on `gfx90a`.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Ville Pietilä <>
[CK][CK_TILE] Fix library caching bug in gemm dispatcher
(#6445)
## Motivation
setup_gemm_dispatcher() was rebuilding libraries on every call instead
of reusing cached libraries.
**Root Cause**:
1. Library names only included dtype+layout, causing different
tile/wave/warp configs to overwrite each other
2. No cache checking - always loaded default library, detected mismatch,
then rebuilt
## Technical Details
**Solution**:
1. Complete library naming with all distinguishing parameters:
libdispatcher_gemm_{dtype}_{layout}_{tile}_{wave}_{warp}_{pipeline}_{epilogue}_{scheduler}.so
2. Cache checking before rebuild:
- Check if library for exact config already exists
- Reuse if found (500x faster: 0.02s vs 10s)
- Only rebuild when no cached library exists
3. Better error handling for kernel generation failures
Files Changed:
- dispatcher/python/ctypes_utils.py
- dispatcher/tests/test_library_caching.py (new unit test)
## Test Plan
Use `dispatcher/tests/test_library_caching.py ` to ensure that libraries
are cached and only rebuilt if they are not present in build directory
1. **test_01_unique_library_naming** - Library names include all
parameters (dtype, layout, tile, wave, warp, pipeline, epilogue,
scheduler)
2. **test_02_library_build_and_cache** - Libraries are built once and
then cached for reuse
3. **test_03_different_configs_different_libraries** - Different configs
create different library files
4. **test_04_cache_message_verification** - Cache hit messages are
logged correctly
5. **test_05_code_fix_verification** - Code changes are present in
ctypes_utils.py
## Test Result
All the test above passed.
## Submission Checklist
- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Remove obsolete benchmark_fwd_v3.sh script and README
reference (#6305)
The tile_example_fmha_fwd_v3 target no longer exists in this project,
making this benchmark script non-functional.
[CK_TILE] Separate PermuteN epilogue from CShuffle epilogue
into standalone file (#5863)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
The PermuteN epilogue was previously embedded within
cshuffle_epilogue.hpp, despite having fundamentally different behaviour.
Coupling these two independent strategies in one file introduced
unnecessary complexity, SFINAE guards, and a dual operator() overload
selected at compile time via TiledMMAPermuteN_ template parameter.
This PR separates PermuteN into its own standalone
file(pertmuten_epilogue.hpp), simplifying both implementations and
making the codebase easier to maintain and extend independently.
## Technical Details
**New file: permuten_epilogue.hpp:**
contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from
the permuteN code path in cshuffle_epilogue.hpp.
**Cleanup of cshuffle_epilogue.hpp:**
- Removed the TiledMMAPermuteN_ template parameter from
[CShuffleEpilogueProblem]
- Removed the SFINAE-guarded permuteN operator() overload
- Removed the EnablePermuateN_ SFINAE alias
- CShuffle now only contains CShuffle logic; EightWave support
(independent feature) is retained
**Consumer migration :**
All consumer files now use compile-time epilogue selection via
[std::conditional_t]
`using GemmEpilogue = std::conditional_t<
TiledMMAPermuteN,
PermuteNEpilogue<PermuteNEpilogueProblem<...>>,
CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;`
**Files modified:**
- flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp,
mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples
- run_gemm_quant_example.inc — block-scale GEMM example
- gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker
- test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp,
test_gemm_pipeline_util.hpp — test utilities
- universal_gemm_invoker.hpp — universal GEMM invoker
- epilogue.hpp — add header updated to include permuten_epilogue.hpp
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Restructure Tile Engine's benchmarking and
profiling (#4769)
## Motivation
This PR introduces a restructure for the benchmarking and profiling
aspects of CK Tile's Tile Engine, expanding on the groundwork from this
previous https://github.com/ROCm/composable_kernel/pull/3434 and
outlined in this [design
document](https://amdcloud-my.sharepoint.com/:w:/r/personal/astharai_amd_com/Documents/Restructuring%20Tile%20Engine.docx?d=w14ea28a30718416988ed5ebb759bd3b2&csf=1&web=1&e=l3VBuX).
In PR 3434, to reduce repeated code we implemented:
- Base class that centralizes common functionality and provides a
default implementation (Universal GEMM)
- Child classes for GEMM variants override virtual functions to handle
variant-specific behavior
This refactoring in this PR follows the same process and should greatly
reduce the duplicated code present in Tile Engine and make it simpler to
add in new operations, increasing scalability.
## Technical Details
The files have been refactored around new base structs for benchmarks,
profiling and problem descriptions. The new base structs are:
- GemmProblem
- GemmBenchmark
- GemmProfiler
Universal GEMM, Preshuffle GEMM, and Multi-D GEMM all have child classes
that will inherit from these base structs overriding only what differs
per variant.
All common functions across the benchmarking and profiling files have
been moved into newly added common utility files under the commons/
directory. The new utility files are:
- utils.hpp: common functions for the benchmarking and profiling process
- benchmark_utils.py: common utility functions for the benchmark
generation
## Test Plan
I tested using the existing tests for Tile Engine.
## Test Result
All tests passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[ck] Clamp negative kernel execution elapsed time to zero
(#6379)
## Motivation
hipEventElapsedTime can return a small negative value on Windows when
timing a very fast kernel launch on the null stream. This caused
consumers of launch_and_time_kernel to receive a negative elapsed time,
which they reasonably treat as an error, breaking otherwise-correct
kernel executions.
## Technical Details
After calling hipEventElapsedTime, a clamp is applied in
launch_and_time_kernel before the result is returned, avoiding the
return of a physically impossible elapsed time.
The negative value from hipEventElapsedTime has been observed on
Windows. For kernels that complete in well under a millisecond, the HIP
event timestamps can alias such that the computed difference is a small
negative number (observed: ~-1.78 ms). No HIP error is reported by any
surrounding call (hipEventRecord, hipEventSynchronize, hipGetLastError),
confirming the kernel itself executed successfully.
## Test Plan
- Recompile CK and validate no kernel execution reports a negative
elapsed time during hipTensor tests.
- Pass the CI/CD pre-checking tests for CK.
## Test Result
- All tests passing
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Skip fp16 dropout d256 batch tests for compiler VGPR
aliasing bug (#6342)
## Summary
- Skip fp16 FMHA forward dropout tests that use the d256 tile in batch
mode, gated on compiler version
- The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of
Philox RNG parameters under high register pressure (383 VGPRs)
- bf16 dropout tests are unaffected and cover the same code paths
## Root Cause
The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state
stored in VGPRs) with other live data during the softmax main loop. This
causes corrupted `buffer_store_byte` writes for dropout randval on wave
lanes 32-63, producing NaN in output and LSE tensors.
**Conditions:** fp16 + d256 tile + dropout + batch mode + `qr` pipeline
+ gfx90a
## Changes
- `include/ck_tile/core/config.hpp`: Add
`CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro
- `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in
`TEST_P(Dropout, ...)`
## Test plan
- [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip
active)
- [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED
- [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)
[CK][CK TILE] Modify elementwise kernel template signature to
accept independent type arguments (#6399)
## Motivation
modify elementwise kernel template signature to fix cshuffle epilogue
build error
## Technical Details
Encountered a build error while building conv fallback kernel with
dispatcher.
Error: Type mismatch in `ElementWiseKernel::operator()` where the
template required all three parameters (lens, input_strides,
output_strides) to be the same type, but the CShuffle epilogue was
passing them with different tuple element types.
Solution: Modified the template signature in elementwise_kernel.hpp to
accept three independent type parameters:
Changed from single typename `Dims` to typename `DimsLens`, typename
`DimsInStrides`, typename `DimsOutStrides`
Updated references to `Dims::size()` to use the appropriate specific
type
## Test Plan
- Test with dispatcher conv unit tests
- Relying on CI tests
## Test Result
- Dispatcher unit tests passed
- Relying on CI tests
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[MIOPEN] [CK] Revert "[CK] Disable test cases affected by
compiler codegen bugs on gfx90a" (#6400)
Reverts ROCm/rocm-libraries#6343
This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.
[CK] Disable compilation of problematic bwd weight conv
instances for gfx90a (#6343)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Due to compiler version update, there are test failures in the test
suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current `develop` branch, the test
failures are blocking any PR merges into `develop`. An example of a
failed CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as tests are passing for clang compiler version 20.0 and failing
for clang compiler version 22.0.
## Technical Details
This PR disables the compilation of the problematic bwd weight conv
instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that
allows us to detect when we are compiling for `gfx90a`. Using the new
CMake flag, compilation of instance
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`.
Co-authored-by: Ville Pietilä <>
CK: Extract shared boilerplate from 47 gemm_quant test files
(#6323)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Depends on #6303
## Summary
Extract shared test boilerplate (includes, type aliases, test fixture
macros) from 47 `test_gemm_quant_*` files into a single
`test_gemm_quant_common.hpp` header. Each test file is reduced from ~50
lines of boilerplate to ~5 lines.
| Metric | Value |
|--------|-------|
| Files changed | 48 |
| Insertions | +413 |
| Deletions | −1,106 |
| **Net lines removed** | **−693** |
### What changed
| Before | After |
|--------|-------|
| 47 test files, each with ~50 lines of identical includes, type
aliases, and fixture macros | 1 shared header
(`test_gemm_quant_common.hpp`) + 47 thin files (~5 lines each: include +
params) |
### Readability assessment
A code realist review confirmed this change **improves readability**:
the 47 test files had identical boilerplate obscuring the only
meaningful content — the `GemmConfig` type alias and test dimensions.
After the refactoring, each file's unique configuration is immediately
visible, and adding a new test variant requires specifying only the
varying parameters instead of copying 50 lines.
### Cumulative cleanup series stats
| PR | Description | Net lines |
|----|-------------|-----------|
| #6300 | Remove 61 dead `#if 0` blocks | −2,648 |
| #6302 | Remove 41 commented-out dead code blocks | −2,861 |
| #6303 | Remove 4 orphaned files | −3,886 |
| This PR | Extract gemm_quant test boilerplate | −693 |
| **Total** | | **−10,088** |
CK: Remove 4 orphaned files with verified replacements
(~1,025 lines) (#6303)
Depends on #6302
## Summary
Remove 4 orphaned files that have verified replacements already in the
build.
| File | Reason | Replacement |
|------|--------|-------------|
| `test_gemm_pipeline_compiler.cpp` | Refactored into 13 smaller tests |
`_compv3`, `_compv4`, `_mem`, `_persistent`, etc. |
| `test_grouped_gemm_quant.cpp` | Refactored into 5 smaller tests |
`_rowcol`, `_tensor`, `_aquant`, `_bquant`, etc. |
| `..._f8_f8_f16_..._comp_default_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
| `..._f8_f8_f16_..._comp_kpadding_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
Each deletion was verified:
- Original file is NOT in any CMakeLists.txt
- Replacement files ARE in CMakeLists.txt and actively compiled
- Content is fully covered by the replacement files
[CK] Add flash_attn tests
## Motivation
Add CI support for running
[flash-attention](https://github.com/ROCm/flash-attention) tests against
CK, similar to existing AITER and PyTorch downstream test pipelines.
## Technical Details
### New: `Dockerfile.fa`
A new Dockerfile that builds a flash-attention test image on top of a
ROCm PyTorch base image. It:
- Sparse-checkouts CK from `rocm-libraries` (or clones directly from
`ROCm/composable_kernel`)
- Clones and builds `flash-attention` with CK as the backend
- Supports configurable `FA_BRANCH`, `CK_FA_BRANCH`, and `GPU_ARCHS`
build args
### Updated: `Jenkinsfile`
**buildDocker refactor:**
- Extracted `buildAndPushDockerImage()` helper that handles both "check
if exists, skip" and "force build, push" logic, eliminating the
duplicated try/catch blocks
- Split monolithic `buildDocker()` into `buildDockerBase()`,
`buildDockerPytorch()`, `buildDockerAiter()`, and new `buildDockerFa()`
- Each downstream docker build now runs unconditionally within its
respective guard (`RUN_PYTORCH_TESTS`, `RUN_AITER_TESTS`,
`RUN_FA_TESTS`)
- Image digests are stored in env vars (`CK_BASE_IMAGE`,
`CK_PYTORCH_IMAGE`, `CK_AITER_IMAGE`, `CK_FA_IMAGE`) for use in
downstream stages
**run_downstream_tests refactor:**
- Merged `run_aiter_tests()` and `run_pytorch_tests()` into a single
generic `run_downstream_tests(conf)` that accepts `image`,
`timeoutHours`, and `execute_cmds`
- Test commands for each downstream target are declared as top-level
lists (`RUN_PYTORCH_TESTS_CMDS`, `RUN_AITER_TESTS_CMDS`,
`RUN_FA_TESTS_CMDS`)
**Pipeline stages:**
- Merged "Run Pytorch Tests" and "Run AITER Tests" into a single "Run
Downstream Tests" parallel stage
- Added two new FA test stages: "Run FA Tests on gfx942" and "Run FA
Tests on gfx950"
- Added new pipeline parameters: `RUN_FA_TESTS`, `fa_base_docker`,
`fa_branch`, `ck_fa_branch`
- `ck_pytorch_branch` and `ck_aiter_branch` now default to the current
branch instead of hardcoded `develop`
- CRON schedule at 13:00 now also triggers `RUN_FA_TESTS=true`
## Test Plan
- [x] Trigger pipeline manually with `RUN_FA_TESTS=true` on gfx942 and
gfx950 nodes
- [x] Verify existing AITER and PyTorch test stages are unaffected
- [x] Verify `buildAndPushDockerImage` correctly skips rebuild when
image already exists (with `BUILD_DOCKER=false`)
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Correct .readthedocs.yml file path
## Motivation
Read the Docs config files contains outdated file path from their legacy
repos. Update and correct all paths.
## Technical Details
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] [CK_Tile] Add GroupConv to Kernel Dispatcher
## Motivation
This PR adds CK Tile group convolution (forward, backward-data,
backward-weight) support to the kernel dispatcher, matching and unifying
with the existing dispatcher GEMM infrastructure in architecture and
usability. The dispatcher provides a unified kernel dispatch system with
both C++ and Python frontends, and until now only supported GEMM
operations. This PR enables framework integrators to use the same
declarative kernel workflow for convolutions as they do for GEMM:
declare kernels, build a registry JIT, select kernels within the
registry at runtime, and dispatch to GPU. Future PRs will include
runtime kernel selection heuristics for autotuning of kernel parameters
based on (problem, hardware arch).
## Technical Details
Grouped convolution support has been added to the CK Tile Dispatcher
with generated_conv_backend.hpp enabling dispatcher.run(in, wei, out,
problem) for all 6 conv variants (fwd/bwdd/bwdw x 2D/3D), runtime
heuristic kernel selection, and GroupedConvKernelKey with full
ConvConfigBase fields. Python side adds parallel JIT via
registry.build(max_workers) and heuristic registry.select(). Includes 7
C++ and 6 Python examples covering all directions with CPU reference
validation, and shared infrastructure improvements (BaseRegistry CRTP,
structured exceptions). As a sanity check, JIT compile times for a
single kernel remains the same and for multiple kernels there is better
parallelism:
Kernels | 1 worker | 8 workers
1 | 7.7 s | 7.7 s
2 | 15.9 s | 8.2 s
4 | 33.4 s | 9.7 s
6 | 52.3 s | 10.2 s
## Test Plan
145 ephemeral unit tests have been added to test basic functionality.
All 30 examples/integration tests run end-to-end on gfx950 (MI350): 7
C++ conv, 7 C++ GEMM, 6 Python conv, 10 Python GEMM. CPU reference
validation for forward, backward-data, and backward-weight (2D) in both
C++ and Python examples pass.
## Test Result
30 examples pass. Peak performance: 132 TFLOPS (Batch-32 forward 56x56),
53 TFLOPS (pointwise 1x1). CPU reference accuracy: max_abs_diff < 0.002
for all directions (fp16 vs fp32 reference).
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12
(#6156)
## Motivation
On gfx11/gfx12, FMHA forward kernels that require head-dim padding show
a large performance drop compared to the exact-head-dim path. In
practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too
far off the fast path.
This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while
keeping the behavior for other GPUs unchanged.
## Technical Details
- Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path
for gfx11/gfx12.
- For `receipt=0`, keep support conservative and only enable the padded
fast path for vector-safe cases (`head_dim % 8 == 0`), matching the
existing assumption used on other GPUs.
- Move `v_prefetch` later only for the head-dim-padded path on
gfx11/gfx12. This reduces live ranges and removes the register-spill
behavior seen in the earlier scheduling.
- Enable the buffer-load OOB check offset trick for the padded path on
gfx11/gfx12.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
Observed padded-head-dim performance improvements for HDIM=72/80:
- gfx11: about ~3.5x
- gfx1151: about ~2.0x
- gfx12: about ~1.3x
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] [CK Tile] Improved ci_safety_check in smart-build
infrastructure (#6215)
## Motivation
The two-dot syntax (origin/develop..HEAD) is more conservative and
catches a broader set of changes when PRs merge develop branch. While
three-dot syntax shows only PR-specific changes, two-dot ensures we
don't miss any files that differ between develop and the PR branch,
including files modified in both the PR and merged develop commits.
This conservative approach prioritizes catching all potential issues
over CI efficiency, which is appropriate for build system change
detection.
# Technical Details:
- Switched to two-dot (..) syntax in ci_safety_check.sh
- Update comments to clarify the intentional use of two-dot syntax
- Maintain consistency across both CHANGE_ID branches
- Trigger full build when any of the following changes
- `Dockerfile|Jenkinsfile|CMakePresets\.json|script/dependency-parser/`
## Test Plan
Tested with PR 6200 which has multiple merge-commits.
## Test Result
It detects 43 new tests compared to 3-dot scheme.
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILLE] Temporarily remove batch prefill KV cache overflow
asserts (#6201)
## Summary
- Temporarily remove the KV cache offset overflow assert checks in
`FmhaBatchPrefillWithPagedKVCacheKernel`
- The asserts are **correct**, but they block project progress in
certain configurations
- This is a **temporary workaround** to unblock progress; a proper fix
will follow
## Note
This is NOT a permanent solution. A follow-up PR will add proper
overflow handling that addresses the underlying issue without blocking
progress.
[CK][CK_Tile] Ensure CK Tile engine benchmarking targets are
excluded from default build. (#6135)
## Motivation
Ensuring that tile engine benchmarking does not build by default and
slow other developers.
## Technical Details
- Added EXCLUDE_FROM_ALL to all add_subdirectory calls in
tile_engine/CMakeLists.txt and ops/gemm/CMakeLists.txt, so none of the
tile engine ops targets are part of the default all build.
- Added missing EXCLUDE_FROM_ALL to add_executable in
ops/pooling/CMakeLists.txt and ops/reduce/CMakeLists.txt (the GEMM
variants already had it).
- Downgraded message(STATUS ...) to message(VERBOSE ...) (or DEBUG for
per-target creation) in ops/pooling/, ops/gemm_streamk/, and ops/reduce/
CMakeLists. The other four GEMM variants (gemm_universal, gemm_multi_d,
gemm_preshuffle, grouped_gemm) already used VERBOSE.
- Targets can still be built on demand via their aggregate names (e.g.
make benchmark_pooling_all, make benchmark_gemm_streamk_all).
## Test Plan
Tile engine benchmark testing stage should be unaffected.
## Test Result
N/A
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Add missing gfx1033 to gfx103 group definition in ck
## Motivation
Resolving PyTorch build failures when enabling builds for gfx103X-all
family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033`
is the only failing architecture in the family and the failures point to
missing support in CK.
## Technical Details
PyTorch build fails with repeated error message
```
/__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
33 | wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
`gfx1033` is missing from the `__gfx103__` group which results in
`CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in
`gfx1033` to the missing files which should be the minimum fix to allow
torch builds to pass.
## Test Plan
Compile sample test file and target gfx1033
```
...
#ifdef __HIP_DEVICE_COMPILE__
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value");
#else
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value");
#endif
```
## Test Result
Prior to the applying patch, compilation fails with `error: use of
undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'`
After applying patch, test file compiles successfully.
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Replace daily CI builds with mainline compiler with
TheRock compiler. (#6147)
## Motivation
Since the compiler team has deprecated the amd-mainline branch and
switched to TheRock, we'll start building a docker image with TheRock
artifacts and building/testing Ck with that.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Fix architecture-dependent EightWave assignment in
cshuffle_epilogue (#6102)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Tile engine CI build on the develop branch started failing after a
recent change(https://github.com/ROCm/rocm-libraries/pull/5218) in
`cshuffle_epilogue.hpp`. The `EightWave` constant was unconditionally
computed as `(MWave * NWave == 8)` for all architectures, but this logic
is only valid for gfx9*. On other architectures (e.g., gfx1201),
`EightWave` must always be `false`, otherwise it leads to incorrect
`BlockedXDLN_PerWarp` computation and build failures.
## Technical Details
In `cshuffle_epilogue.hpp`, the `EightWave` static constexpr was set as:
```cpp
static constexpr bool EightWave = (MWave * NWave == 8);
```
This was applied regardless of the target GPU architecture. The fix uses
a preprocessor guard to make this architecture-aware:
- **gfx9* (`__gfx9__`):** `EightWave` is evaluated as `(MWave * NWave ==
8)` — true or false depending on the wave configuration
- **All other architectures:** `EightWave` defaults to `false`
## Test Plan
- Tile engine CI build on develop branch
## Test Result
- *Pending CI*
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Restrict FMHA codegen to the kernel subset used by
FlashAttention (#6038)
## Motivation
Currently, the CK FlashAttention integration generates a broader FMHA
kernel set than the FlashAttention wrappers can actually dispatch, which
increases compile time without improving runtime coverage.
## Technical Details
The FlashAttention CK wrappers do not use all logits/LSE variants
emitted by the default FMHA codegen. The direct `fmha_fwd` path always
uses softcap-disabled, LSE-enabled kernels, and the `fmha_fwd_splitkv`
path only uses softcap-disabled kernels. This change trims codegen to
that subset and stops generating the unused logits/LSE variants.
This reduces the generated forward kernel set without changing
`fmha_fwd_appendkv` or `fmha_bwd`. The reduced kernel set was validated
by building and running the
[FlashAttention](https://github.com/Dao-AILab/flash-attention) CK
backend.
Across targets, the total generated FMHA kernel count is reduced by:
- `gfx942`: 29.3%
- `gfx1100`: 33.7%
- `gfx1201`: 31.3%
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
pytest test/test_flash_attn_ck.py from
https://github.com/Dao-AILab/flash-attention
## Test Result
all tests passed
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Stream-K gtest Code Gen
## Motivation
Stream-K was using the tile engine infrastructure for smoke tests.
However, tile engine creates a different target per kernel instance,
which has resulted in scalability issues when used in the context of
unit tests. To avoid burdens on cmake configuration and build time, we
have opted to remove our Stream-K tile engine tests. Instead, we use
pure gtests with code gen to generate repetitive .cpp files.
**Note: This appears to change a lot of files because many files are
removed since they are now generated at build time.**
## Technical Details
We originally used Tile Engine to facilitate code gen for unit tests
since we found that pure gtests required the addition of many repetitive
.cpp files of the following form:
```cpp
#include "test_gemm_streamk_common_includes.hpp"
template <typename Tuple>
class TestCkTileStreamKBf8 : public TestCkTileStreamK<Tuple>
{
};
#define TEST_SUITE_NAME TestCkTileStreamKBf8
TYPED_TEST_SUITE(TestCkTileStreamKBf8, KernelTypesStreamKBf8);
#include "test_gemm_streamk_atomic_cases.inc"
#undef TEST_SUITE_NAME
```
Due to issues encountered with tile engine, we instead use pure gtests
to generate the repetitive .cpp files. The code generator parses
`KernelTypesStreamK*` type aliases from the types header using a
two-phase approach:
1. At **configure time**, CMake runs the Python script with
`--list_files` to extract the type alias names from the header
(test_gemm_streamk_types.hpp) and compute the list of .cpp file paths
that will be generated. This lets CMake know the exact set of source
files for each target.
2. At **build time**, `add_custom_command` runs the script again with
`--gen_files` to actually emit the .cpp files into the build directory,
triggered only when the types header or generator script changes.
With these changes, we've removed all Stream-K tile engine tests. There
are now 5 targets for Stream-K GEMM tests:
1. test_ck_tile_streamk_atomic_smoke: smoke tests for Atomic reduction
strategy (pipeline: compv3)
2. test_ck_tile_streamk_linear_smoke: smoke tests for Linear reduction
strategy (pipeline: compv3)
3. test_ck_tile_streamk_tree_smoke: smoke tests for Tree reduction
strategy (pipeline: compv3)
4. test_ck_tile_streamk_pipelines_smoke: smoke tests (smaller set) for
pipelines other than compv3
- Since Stream-K can be thought of as a wrapper around universal GEMM,
we don't need to extensively test each pipeline. So, we opt to run a few
tests for different pipelines. Currently, this just consists of the mem
pipeline, but compv4 is coming soon.
5. test_ck_tile_streamk_extended: extended tests
## Test Plan
I have tests the gtests locally on gfx90a, gfx942, and gfx950.
## Test Result
All local tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.