[CK] Fix out of bounds modifications caused by negative
topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242)
## Motivation
Fix sglang randomly crash by filter negative topk ids.
## Technical Details
In sglang expert parallel mode, there may be idle batch (batch=0) fired,
it will reuse batch=1 resource in cuda graph mode. But in topk op, it
will set non used topk ids to -1, in idle batch case, all topk ids are
set to -1. In `MoeSortingMultiPhaseKernel_P0_v1` negative expert id will
cause overwrite somewhere and sglang may randomly crash.
Except idle batch case, if the captured batch sizes are discrete, there
may be -1 of expert id due to the similar logic.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: zovonoir <jialzhu@amd.com>
Fix per-layer conv2d int8 CPU verification reference path
(#6656)
case example_conv2d_fwd_xdl_perlayer_quantization_int8.exe 1 0
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Add missing lds sync
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Grouped Convolution Backward Data Direct Load
(#6624)
## Proposed changes
Add Grouped Convolution Backward Data with Direct Load into
DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation.
This enables direct global memory loading (bypassing LDS) for the
backward data convolution path on gfx950, following the same pattern
used in both backward weight and forward convolution.
Direct load convolution backward data improves performance by avoiding
LDS round-trips for certain configurations on gfx950, which supports a
wider range of instructions. Currently correctness is checked only at
usage point, but should be extended to a standalone UT in the future.
Fix ck4inductor conv instance parsing for NumGroupsToMerge
parameter (#6434)
## Summary
- Add `num_groups_to_merge` field to `CKGroupedConvFwdOp` dataclass to
match the new (#4273) `NumGroupsToMerge` template parameter added to
`DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3`
- Enable inductor tests by default in Jenkins CI
## Test plan
- [x] Built wheel without patch: `test_gen_conv_instances` fails with
`TypeError: takes from 47 to 50 positional arguments but 51 were given`
- [x] Built wheel with patch: `test_gen_conv_instances` passes
[CK_TILE] Preserve input strides in EightWaves async-load
descriptor (#6611)
`MakeAsyncLoadADramWindow` in
`GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view
descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes
strides from lengths and assumes a dense layout. When the input view's
leading-dim stride is larger than its inner length (non-packed memory
layout), the resulting tile window stepped through memory at the wrong
stride.
Compose the unmerge transforms on top of the input view's existing
descriptor instead, so the actual runtime strides are preserved and the
correct `element_space_size` is inherited for bounds checking.
## Test Plan
Added an unit test showing the problem.
## Test Result
The new test fails before fixes and passes after.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Add conv bwd data tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
This PR adds tests for CK Tile's convolution backward data operation to
enable functionality regression tracking and error-detection.
## Technical Details
Currently only NHWGC/GKCYX/NHWGK and NDHWGC/GKCZYX/NDHWGK(2 dim and 3
dim channel-last) layouts are being tested, since only they are
implemented in CK Tile. Current tests support FP16, BF16 and FP32
datatypes and various different convolutions scenarios. The tested
instances are listed in
`experimental/grouped_convolution_tile_instances` directory.
## Test Result
All implemented tests are working properly and passing.
[CK] Fix for hipblaslt error in PyTorch Dockerfile
## Motivation
This PR fixes the hipblaslt client build failures that occur when
building the PyTorch Docker image, which are currently causing failures
in CI.
## Technical Details
- Correctly reset the working directory to tmp/
- Added --use-system-packages to the install.sh to use system installed
laplack packages, as hard-coded paths were not being built.
## Test Plan
Locally built the Docker image using the Dockerfile.
## Test Result
Image was successfully built.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
CK][fmha] Add StreamLLM sink support to batch_prefill
pipeline (#6479)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
The existing paged-KV attention pipelines (pagedkv, splitkv) support
StreamLLM-style sink tokens — a fixed set of initial tokens kept in
attention alongside the sliding window. The `batch_prefill` pipeline
(chunked-prefill with VLLM-style block tables) previously hardcoded
`kHasSink = false`, making it incompatible with sink-based attention
patterns in LLM serving scenarios.
This PR extends `batch_prefill` to support `kHasSink` and wires it
into `fmha_fwd_runner` for validation against the existing CPU
reference.
## Technical Details
**Pipeline** (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`):
- When `kHasSink`, the K/V loop splits into a sink phase [0,
sink_seq_end)
and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv.
- K advance at the sink→window transition jumps
`seqlen_k_start - sink_seq_end + kN0` to bridge the gap.
- V scatter-gather offsets are re-initialized at the transition to fix a
window mismatch bug: V was lagging kN0 behind K after the large jump,
loading from the wrong sequence position.
- Bias window, dropout seq_offset, and mask type (LogitsSinkMask)
updated
for sink-awareness.
**Traits / codegen** (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`,
`fmha_batch_prefill.py`):
- `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded
`false`).
- Codegen adds `F_sink` field; skips batch-mode kernels (group mode
required).
- CMake test filter broadened from 9 → 33 instances covering
fp16/bf16 × mask/nmask × lse/nlse × sink/nsink.
**Runner** (`fmha_fwd_runner.hpp`, `CMakeLists.txt`):
- `fmha_batch_prefill()` dispatched from `run_fwd` when:
group mode + paged KV + num_splits == 1.
- K/V strides corrected for runner's [num_pages, nhead_k,
page_block_size, hdim] layout.
- `page_block_size % 128` check relaxed: batch_prefill supports ps=16.
- CPU reference paged-KV reordering guards extended with
`CK_TILE_FMHA_FWD_BATCH_PREFILL_API`.
## Test Plan
Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run
`tile_example_fmha_fwd` in group mode with page_block_size=16.
Test matrix:
- Mask: no-mask, causal, sliding window
- Sink: nsink, sink=1..128
- dtype: fp16, bf16
- LSE output: on/off
- seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024}
- GQA, chunked prefill, large batch×seqlen
- page_block_size: 16, 32
## Test Result
171 test cases, all valid:y:
- nmask + nsink: ✓
- causal + nsink: ✓
- causal + sink=8: ✓
- sliding window + sink=8 (d=128, d=256): ✓
- bf16, LSE output, GQA: ✓
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
projects/composablekernel: add SwigluStep support for MoE
blockscale (#6118)
## Summary
- add `swiglustep_and_mul` to the composablekernel MoE blockscale
activation enum
- implement the corresponding blockscale epilogue path for `SwigluStep`
- keep existing `silu` and `gelu` paths unchanged
## Scope
This PR covers the classic composablekernel blockscale MoE path under
`projects/composablekernel`.
This is separate from the `ck_tile` / FlatMM path being discussed in
ROCm/rocm-libraries#5992.
## Motivation
`Step-3.5-Flash-FP8` uses `SwigluStep` in its MoE MLP path. The
dependent AITER change needs native support for this activation in the
classic composablekernel MoE blockscale path.
## Validation
- patch is limited to two composablekernel files under
`projects/composablekernel`
- existing `silu` / `gelu` paths are unchanged
- dependent AITER runtime validation hit the classic CK 2-stage path
with AITER MoE enabled
[CK] Remove code duplications in grouped gemm fixed nk
implementations (#4961)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Different flavours of grouped gemm fixed nk implemenations share the
same block to tile mapping logic. Despite that the code responsible for
it is duplicated in each device struct implementation.
- Move `BlockToCTileMap_KBatch_M00_N0_M01Adapt_MLoops` and
`OffsettedBlockToCTileMapMLoops` from the device struct implementations
to a common header file.
- Use the generic Kernel Argument structures in xdl versions of the
fixed nk.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
CI in general. Relevant test and examples are all fixed_nk versions of
grouped gemm multiple D and ABD.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[MIOpen][CK] Fix bwd weight conv test failures by disabling
one block-GEMM V5 instance for 3D convs (#6421)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Due to compiler version update, there are test failures in the test
target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current develop branch, the test
failures are blocking any PR merges into develop. An example of a failed
CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as the tests are passing for clang compiler version 20.0 and
failing for clang compiler version 22.0.
First attempt to fix this problem had to be reverted in #6400 because it
broke MIOpen internal DB sync tests.
## Technical Details
The root cause for the test failures are the block-GEMM V5 instances of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size.
The V5 pipeline uses double register buffer that in combination with
large tile size causes high register pressure. The latest version of
compiler handles the register spillage incorrectly for `gfx90a`, which
cause the kernel to output incorrect results.
The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3`
that do not use direct load for are divided into two groups
- Base instances
- Instances that result into high register usage (currently only one
instance - one that causes the test failures).
This division allows to disable only the V5 block-GEMM flavor of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The
selective disabling leaves the set of instances for 1D and 2D
convolutions unaffected, and removes at runtime two V5 block-GEMM
instances (`ConvBwdWeightDefault` and
`ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the
device is `gfx90a`.
Because MIOpen uses CK's type string (provided by method
`GetTypeString`) to identify the instances, the DB sync tests are
expected to unaffected since there are still the V2 block-GEMM instances
that result in the same type string
(`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by
running the MIOpen DB sync tests that are not part of the normal CK PR
build.
## Test Plan
Running all CI tests + the MIOpen internal DB sync tests is sufficient
to verify the correctness of the code changes.
## Test Result
Verified locally that the previously failing tests
`TestGroupedConvndBwdWeight3d/4.Test3D` and
`TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts
- 231 on `gfx90a`
- 233 on `gfx942`
and are currently passing. This confirms the expectation that two
instances per data type should be disabled on `gfx90a`.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Ville Pietilä <>
[CK][CK_TILE] Fix library caching bug in gemm dispatcher
(#6445)
## Motivation
setup_gemm_dispatcher() was rebuilding libraries on every call instead
of reusing cached libraries.
**Root Cause**:
1. Library names only included dtype+layout, causing different
tile/wave/warp configs to overwrite each other
2. No cache checking - always loaded default library, detected mismatch,
then rebuilt
## Technical Details
**Solution**:
1. Complete library naming with all distinguishing parameters:
libdispatcher_gemm_{dtype}_{layout}_{tile}_{wave}_{warp}_{pipeline}_{epilogue}_{scheduler}.so
2. Cache checking before rebuild:
- Check if library for exact config already exists
- Reuse if found (500x faster: 0.02s vs 10s)
- Only rebuild when no cached library exists
3. Better error handling for kernel generation failures
Files Changed:
- dispatcher/python/ctypes_utils.py
- dispatcher/tests/test_library_caching.py (new unit test)
## Test Plan
Use `dispatcher/tests/test_library_caching.py ` to ensure that libraries
are cached and only rebuilt if they are not present in build directory
1. **test_01_unique_library_naming** - Library names include all
parameters (dtype, layout, tile, wave, warp, pipeline, epilogue,
scheduler)
2. **test_02_library_build_and_cache** - Libraries are built once and
then cached for reuse
3. **test_03_different_configs_different_libraries** - Different configs
create different library files
4. **test_04_cache_message_verification** - Cache hit messages are
logged correctly
5. **test_05_code_fix_verification** - Code changes are present in
ctypes_utils.py
## Test Result
All the test above passed.
## Submission Checklist
- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Remove obsolete benchmark_fwd_v3.sh script and README
reference (#6305)
The tile_example_fmha_fwd_v3 target no longer exists in this project,
making this benchmark script non-functional.
[CK_TILE] Separate PermuteN epilogue from CShuffle epilogue
into standalone file (#5863)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
The PermuteN epilogue was previously embedded within
cshuffle_epilogue.hpp, despite having fundamentally different behaviour.
Coupling these two independent strategies in one file introduced
unnecessary complexity, SFINAE guards, and a dual operator() overload
selected at compile time via TiledMMAPermuteN_ template parameter.
This PR separates PermuteN into its own standalone
file(pertmuten_epilogue.hpp), simplifying both implementations and
making the codebase easier to maintain and extend independently.
## Technical Details
**New file: permuten_epilogue.hpp:**
contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from
the permuteN code path in cshuffle_epilogue.hpp.
**Cleanup of cshuffle_epilogue.hpp:**
- Removed the TiledMMAPermuteN_ template parameter from
[CShuffleEpilogueProblem]
- Removed the SFINAE-guarded permuteN operator() overload
- Removed the EnablePermuateN_ SFINAE alias
- CShuffle now only contains CShuffle logic; EightWave support
(independent feature) is retained
**Consumer migration :**
All consumer files now use compile-time epilogue selection via
[std::conditional_t]
`using GemmEpilogue = std::conditional_t<
TiledMMAPermuteN,
PermuteNEpilogue<PermuteNEpilogueProblem<...>>,
CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;`
**Files modified:**
- flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp,
mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples
- run_gemm_quant_example.inc — block-scale GEMM example
- gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker
- test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp,
test_gemm_pipeline_util.hpp — test utilities
- universal_gemm_invoker.hpp — universal GEMM invoker
- epilogue.hpp — add header updated to include permuten_epilogue.hpp
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Restructure Tile Engine's benchmarking and
profiling (#4769)
## Motivation
This PR introduces a restructure for the benchmarking and profiling
aspects of CK Tile's Tile Engine, expanding on the groundwork from this
previous https://github.com/ROCm/composable_kernel/pull/3434 and
outlined in this [design
document](https://amdcloud-my.sharepoint.com/:w:/r/personal/astharai_amd_com/Documents/Restructuring%20Tile%20Engine.docx?d=w14ea28a30718416988ed5ebb759bd3b2&csf=1&web=1&e=l3VBuX).
In PR 3434, to reduce repeated code we implemented:
- Base class that centralizes common functionality and provides a
default implementation (Universal GEMM)
- Child classes for GEMM variants override virtual functions to handle
variant-specific behavior
This refactoring in this PR follows the same process and should greatly
reduce the duplicated code present in Tile Engine and make it simpler to
add in new operations, increasing scalability.
## Technical Details
The files have been refactored around new base structs for benchmarks,
profiling and problem descriptions. The new base structs are:
- GemmProblem
- GemmBenchmark
- GemmProfiler
Universal GEMM, Preshuffle GEMM, and Multi-D GEMM all have child classes
that will inherit from these base structs overriding only what differs
per variant.
All common functions across the benchmarking and profiling files have
been moved into newly added common utility files under the commons/
directory. The new utility files are:
- utils.hpp: common functions for the benchmarking and profiling process
- benchmark_utils.py: common utility functions for the benchmark
generation
## Test Plan
I tested using the existing tests for Tile Engine.
## Test Result
All tests passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[ck] Clamp negative kernel execution elapsed time to zero
(#6379)
## Motivation
hipEventElapsedTime can return a small negative value on Windows when
timing a very fast kernel launch on the null stream. This caused
consumers of launch_and_time_kernel to receive a negative elapsed time,
which they reasonably treat as an error, breaking otherwise-correct
kernel executions.
## Technical Details
After calling hipEventElapsedTime, a clamp is applied in
launch_and_time_kernel before the result is returned, avoiding the
return of a physically impossible elapsed time.
The negative value from hipEventElapsedTime has been observed on
Windows. For kernels that complete in well under a millisecond, the HIP
event timestamps can alias such that the computed difference is a small
negative number (observed: ~-1.78 ms). No HIP error is reported by any
surrounding call (hipEventRecord, hipEventSynchronize, hipGetLastError),
confirming the kernel itself executed successfully.
## Test Plan
- Recompile CK and validate no kernel execution reports a negative
elapsed time during hipTensor tests.
- Pass the CI/CD pre-checking tests for CK.
## Test Result
- All tests passing
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Skip fp16 dropout d256 batch tests for compiler VGPR
aliasing bug (#6342)
## Summary
- Skip fp16 FMHA forward dropout tests that use the d256 tile in batch
mode, gated on compiler version
- The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of
Philox RNG parameters under high register pressure (383 VGPRs)
- bf16 dropout tests are unaffected and cover the same code paths
## Root Cause
The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state
stored in VGPRs) with other live data during the softmax main loop. This
causes corrupted `buffer_store_byte` writes for dropout randval on wave
lanes 32-63, producing NaN in output and LSE tensors.
**Conditions:** fp16 + d256 tile + dropout + batch mode + `qr` pipeline
+ gfx90a
## Changes
- `include/ck_tile/core/config.hpp`: Add
`CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro
- `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in
`TEST_P(Dropout, ...)`
## Test plan
- [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip
active)
- [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED
- [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)
[CK][CK TILE] Modify elementwise kernel template signature to
accept independent type arguments (#6399)
## Motivation
modify elementwise kernel template signature to fix cshuffle epilogue
build error
## Technical Details
Encountered a build error while building conv fallback kernel with
dispatcher.
Error: Type mismatch in `ElementWiseKernel::operator()` where the
template required all three parameters (lens, input_strides,
output_strides) to be the same type, but the CShuffle epilogue was
passing them with different tuple element types.
Solution: Modified the template signature in elementwise_kernel.hpp to
accept three independent type parameters:
Changed from single typename `Dims` to typename `DimsLens`, typename
`DimsInStrides`, typename `DimsOutStrides`
Updated references to `Dims::size()` to use the appropriate specific
type
## Test Plan
- Test with dispatcher conv unit tests
- Relying on CI tests
## Test Result
- Dispatcher unit tests passed
- Relying on CI tests
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[MIOPEN] [CK] Revert "[CK] Disable test cases affected by
compiler codegen bugs on gfx90a" (#6400)
Reverts ROCm/rocm-libraries#6343
This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.
[CK] Disable compilation of problematic bwd weight conv
instances for gfx90a (#6343)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Due to compiler version update, there are test failures in the test
suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current `develop` branch, the test
failures are blocking any PR merges into `develop`. An example of a
failed CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as tests are passing for clang compiler version 20.0 and failing
for clang compiler version 22.0.
## Technical Details
This PR disables the compilation of the problematic bwd weight conv
instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that
allows us to detect when we are compiling for `gfx90a`. Using the new
CMake flag, compilation of instance
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`.
Co-authored-by: Ville Pietilä <>
CK: Extract shared boilerplate from 47 gemm_quant test files
(#6323)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Depends on #6303
## Summary
Extract shared test boilerplate (includes, type aliases, test fixture
macros) from 47 `test_gemm_quant_*` files into a single
`test_gemm_quant_common.hpp` header. Each test file is reduced from ~50
lines of boilerplate to ~5 lines.
| Metric | Value |
|--------|-------|
| Files changed | 48 |
| Insertions | +413 |
| Deletions | −1,106 |
| **Net lines removed** | **−693** |
### What changed
| Before | After |
|--------|-------|
| 47 test files, each with ~50 lines of identical includes, type
aliases, and fixture macros | 1 shared header
(`test_gemm_quant_common.hpp`) + 47 thin files (~5 lines each: include +
params) |
### Readability assessment
A code realist review confirmed this change **improves readability**:
the 47 test files had identical boilerplate obscuring the only
meaningful content — the `GemmConfig` type alias and test dimensions.
After the refactoring, each file's unique configuration is immediately
visible, and adding a new test variant requires specifying only the
varying parameters instead of copying 50 lines.
### Cumulative cleanup series stats
| PR | Description | Net lines |
|----|-------------|-----------|
| #6300 | Remove 61 dead `#if 0` blocks | −2,648 |
| #6302 | Remove 41 commented-out dead code blocks | −2,861 |
| #6303 | Remove 4 orphaned files | −3,886 |
| This PR | Extract gemm_quant test boilerplate | −693 |
| **Total** | | **−10,088** |
CK: Remove 4 orphaned files with verified replacements
(~1,025 lines) (#6303)
Depends on #6302
## Summary
Remove 4 orphaned files that have verified replacements already in the
build.
| File | Reason | Replacement |
|------|--------|-------------|
| `test_gemm_pipeline_compiler.cpp` | Refactored into 13 smaller tests |
`_compv3`, `_compv4`, `_mem`, `_persistent`, etc. |
| `test_grouped_gemm_quant.cpp` | Refactored into 5 smaller tests |
`_rowcol`, `_tensor`, `_aquant`, `_bquant`, etc. |
| `..._f8_f8_f16_..._comp_default_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
| `..._f8_f8_f16_..._comp_kpadding_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
Each deletion was verified:
- Original file is NOT in any CMakeLists.txt
- Replacement files ARE in CMakeLists.txt and actively compiled
- Content is fully covered by the replacement files
[CK] Add flash_attn tests
## Motivation
Add CI support for running
[flash-attention](https://github.com/ROCm/flash-attention) tests against
CK, similar to existing AITER and PyTorch downstream test pipelines.
## Technical Details
### New: `Dockerfile.fa`
A new Dockerfile that builds a flash-attention test image on top of a
ROCm PyTorch base image. It:
- Sparse-checkouts CK from `rocm-libraries` (or clones directly from
`ROCm/composable_kernel`)
- Clones and builds `flash-attention` with CK as the backend
- Supports configurable `FA_BRANCH`, `CK_FA_BRANCH`, and `GPU_ARCHS`
build args
### Updated: `Jenkinsfile`
**buildDocker refactor:**
- Extracted `buildAndPushDockerImage()` helper that handles both "check
if exists, skip" and "force build, push" logic, eliminating the
duplicated try/catch blocks
- Split monolithic `buildDocker()` into `buildDockerBase()`,
`buildDockerPytorch()`, `buildDockerAiter()`, and new `buildDockerFa()`
- Each downstream docker build now runs unconditionally within its
respective guard (`RUN_PYTORCH_TESTS`, `RUN_AITER_TESTS`,
`RUN_FA_TESTS`)
- Image digests are stored in env vars (`CK_BASE_IMAGE`,
`CK_PYTORCH_IMAGE`, `CK_AITER_IMAGE`, `CK_FA_IMAGE`) for use in
downstream stages
**run_downstream_tests refactor:**
- Merged `run_aiter_tests()` and `run_pytorch_tests()` into a single
generic `run_downstream_tests(conf)` that accepts `image`,
`timeoutHours`, and `execute_cmds`
- Test commands for each downstream target are declared as top-level
lists (`RUN_PYTORCH_TESTS_CMDS`, `RUN_AITER_TESTS_CMDS`,
`RUN_FA_TESTS_CMDS`)
**Pipeline stages:**
- Merged "Run Pytorch Tests" and "Run AITER Tests" into a single "Run
Downstream Tests" parallel stage
- Added two new FA test stages: "Run FA Tests on gfx942" and "Run FA
Tests on gfx950"
- Added new pipeline parameters: `RUN_FA_TESTS`, `fa_base_docker`,
`fa_branch`, `ck_fa_branch`
- `ck_pytorch_branch` and `ck_aiter_branch` now default to the current
branch instead of hardcoded `develop`
- CRON schedule at 13:00 now also triggers `RUN_FA_TESTS=true`
## Test Plan
- [x] Trigger pipeline manually with `RUN_FA_TESTS=true` on gfx942 and
gfx950 nodes
- [x] Verify existing AITER and PyTorch test stages are unaffected
- [x] Verify `buildAndPushDockerImage` correctly skips rebuild when
image already exists (with `BUILD_DOCKER=false`)
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Correct .readthedocs.yml file path
## Motivation
Read the Docs config files contains outdated file path from their legacy
repos. Update and correct all paths.
## Technical Details
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] [CK_Tile] Add GroupConv to Kernel Dispatcher
## Motivation
This PR adds CK Tile group convolution (forward, backward-data,
backward-weight) support to the kernel dispatcher, matching and unifying
with the existing dispatcher GEMM infrastructure in architecture and
usability. The dispatcher provides a unified kernel dispatch system with
both C++ and Python frontends, and until now only supported GEMM
operations. This PR enables framework integrators to use the same
declarative kernel workflow for convolutions as they do for GEMM:
declare kernels, build a registry JIT, select kernels within the
registry at runtime, and dispatch to GPU. Future PRs will include
runtime kernel selection heuristics for autotuning of kernel parameters
based on (problem, hardware arch).
## Technical Details
Grouped convolution support has been added to the CK Tile Dispatcher
with generated_conv_backend.hpp enabling dispatcher.run(in, wei, out,
problem) for all 6 conv variants (fwd/bwdd/bwdw x 2D/3D), runtime
heuristic kernel selection, and GroupedConvKernelKey with full
ConvConfigBase fields. Python side adds parallel JIT via
registry.build(max_workers) and heuristic registry.select(). Includes 7
C++ and 6 Python examples covering all directions with CPU reference
validation, and shared infrastructure improvements (BaseRegistry CRTP,
structured exceptions). As a sanity check, JIT compile times for a
single kernel remains the same and for multiple kernels there is better
parallelism:
Kernels | 1 worker | 8 workers
1 | 7.7 s | 7.7 s
2 | 15.9 s | 8.2 s
4 | 33.4 s | 9.7 s
6 | 52.3 s | 10.2 s
## Test Plan
145 ephemeral unit tests have been added to test basic functionality.
All 30 examples/integration tests run end-to-end on gfx950 (MI350): 7
C++ conv, 7 C++ GEMM, 6 Python conv, 10 Python GEMM. CPU reference
validation for forward, backward-data, and backward-weight (2D) in both
C++ and Python examples pass.
## Test Result
30 examples pass. Peak performance: 132 TFLOPS (Batch-32 forward 56x56),
53 TFLOPS (pointwise 1x1). CPU reference accuracy: max_abs_diff < 0.002
for all directions (fp16 vs fp32 reference).
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12
(#6156)
## Motivation
On gfx11/gfx12, FMHA forward kernels that require head-dim padding show
a large performance drop compared to the exact-head-dim path. In
practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too
far off the fast path.
This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while
keeping the behavior for other GPUs unchanged.
## Technical Details
- Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path
for gfx11/gfx12.
- For `receipt=0`, keep support conservative and only enable the padded
fast path for vector-safe cases (`head_dim % 8 == 0`), matching the
existing assumption used on other GPUs.
- Move `v_prefetch` later only for the head-dim-padded path on
gfx11/gfx12. This reduces live ranges and removes the register-spill
behavior seen in the earlier scheduling.
- Enable the buffer-load OOB check offset trick for the padded path on
gfx11/gfx12.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
Observed padded-head-dim performance improvements for HDIM=72/80:
- gfx11: about ~3.5x
- gfx1151: about ~2.0x
- gfx12: about ~1.3x
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] [CK Tile] Improved ci_safety_check in smart-build
infrastructure (#6215)
## Motivation
The two-dot syntax (origin/develop..HEAD) is more conservative and
catches a broader set of changes when PRs merge develop branch. While
three-dot syntax shows only PR-specific changes, two-dot ensures we
don't miss any files that differ between develop and the PR branch,
including files modified in both the PR and merged develop commits.
This conservative approach prioritizes catching all potential issues
over CI efficiency, which is appropriate for build system change
detection.
# Technical Details:
- Switched to two-dot (..) syntax in ci_safety_check.sh
- Update comments to clarify the intentional use of two-dot syntax
- Maintain consistency across both CHANGE_ID branches
- Trigger full build when any of the following changes
- `Dockerfile|Jenkinsfile|CMakePresets\.json|script/dependency-parser/`
## Test Plan
Tested with PR 6200 which has multiple merge-commits.
## Test Result
It detects 43 new tests compared to 3-dot scheme.
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILLE] Temporarily remove batch prefill KV cache overflow
asserts (#6201)
## Summary
- Temporarily remove the KV cache offset overflow assert checks in
`FmhaBatchPrefillWithPagedKVCacheKernel`
- The asserts are **correct**, but they block project progress in
certain configurations
- This is a **temporary workaround** to unblock progress; a proper fix
will follow
## Note
This is NOT a permanent solution. A follow-up PR will add proper
overflow handling that addresses the underlying issue without blocking
progress.
[CK][CK_Tile] Ensure CK Tile engine benchmarking targets are
excluded from default build. (#6135)
## Motivation
Ensuring that tile engine benchmarking does not build by default and
slow other developers.
## Technical Details
- Added EXCLUDE_FROM_ALL to all add_subdirectory calls in
tile_engine/CMakeLists.txt and ops/gemm/CMakeLists.txt, so none of the
tile engine ops targets are part of the default all build.
- Added missing EXCLUDE_FROM_ALL to add_executable in
ops/pooling/CMakeLists.txt and ops/reduce/CMakeLists.txt (the GEMM
variants already had it).
- Downgraded message(STATUS ...) to message(VERBOSE ...) (or DEBUG for
per-target creation) in ops/pooling/, ops/gemm_streamk/, and ops/reduce/
CMakeLists. The other four GEMM variants (gemm_universal, gemm_multi_d,
gemm_preshuffle, grouped_gemm) already used VERBOSE.
- Targets can still be built on demand via their aggregate names (e.g.
make benchmark_pooling_all, make benchmark_gemm_streamk_all).
## Test Plan
Tile engine benchmark testing stage should be unaffected.
## Test Result
N/A
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Add missing gfx1033 to gfx103 group definition in ck
## Motivation
Resolving PyTorch build failures when enabling builds for gfx103X-all
family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033`
is the only failing architecture in the family and the failures point to
missing support in CK.
## Technical Details
PyTorch build fails with repeated error message
```
/__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
33 | wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
`gfx1033` is missing from the `__gfx103__` group which results in
`CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in
`gfx1033` to the missing files which should be the minimum fix to allow
torch builds to pass.
## Test Plan
Compile sample test file and target gfx1033
```
...
#ifdef __HIP_DEVICE_COMPILE__
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value");
#else
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value");
#endif
```
## Test Result
Prior to the applying patch, compilation fails with `error: use of
undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'`
After applying patch, test file compiles successfully.
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Replace daily CI builds with mainline compiler with
TheRock compiler. (#6147)
## Motivation
Since the compiler team has deprecated the amd-mainline branch and
switched to TheRock, we'll start building a docker image with TheRock
artifacts and building/testing Ck with that.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Fix architecture-dependent EightWave assignment in
cshuffle_epilogue (#6102)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Tile engine CI build on the develop branch started failing after a
recent change(https://github.com/ROCm/rocm-libraries/pull/5218) in
`cshuffle_epilogue.hpp`. The `EightWave` constant was unconditionally
computed as `(MWave * NWave == 8)` for all architectures, but this logic
is only valid for gfx9*. On other architectures (e.g., gfx1201),
`EightWave` must always be `false`, otherwise it leads to incorrect
`BlockedXDLN_PerWarp` computation and build failures.
## Technical Details
In `cshuffle_epilogue.hpp`, the `EightWave` static constexpr was set as:
```cpp
static constexpr bool EightWave = (MWave * NWave == 8);
```
This was applied regardless of the target GPU architecture. The fix uses
a preprocessor guard to make this architecture-aware:
- **gfx9* (`__gfx9__`):** `EightWave` is evaluated as `(MWave * NWave ==
8)` — true or false depending on the wave configuration
- **All other architectures:** `EightWave` defaults to `false`
## Test Plan
- Tile engine CI build on develop branch
## Test Result
- *Pending CI*
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Restrict FMHA codegen to the kernel subset used by
FlashAttention (#6038)
## Motivation
Currently, the CK FlashAttention integration generates a broader FMHA
kernel set than the FlashAttention wrappers can actually dispatch, which
increases compile time without improving runtime coverage.
## Technical Details
The FlashAttention CK wrappers do not use all logits/LSE variants
emitted by the default FMHA codegen. The direct `fmha_fwd` path always
uses softcap-disabled, LSE-enabled kernels, and the `fmha_fwd_splitkv`
path only uses softcap-disabled kernels. This change trims codegen to
that subset and stops generating the unused logits/LSE variants.
This reduces the generated forward kernel set without changing
`fmha_fwd_appendkv` or `fmha_bwd`. The reduced kernel set was validated
by building and running the
[FlashAttention](https://github.com/Dao-AILab/flash-attention) CK
backend.
Across targets, the total generated FMHA kernel count is reduced by:
- `gfx942`: 29.3%
- `gfx1100`: 33.7%
- `gfx1201`: 31.3%
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
pytest test/test_flash_attn_ck.py from
https://github.com/Dao-AILab/flash-attention
## Test Result
all tests passed
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Stream-K gtest Code Gen
## Motivation
Stream-K was using the tile engine infrastructure for smoke tests.
However, tile engine creates a different target per kernel instance,
which has resulted in scalability issues when used in the context of
unit tests. To avoid burdens on cmake configuration and build time, we
have opted to remove our Stream-K tile engine tests. Instead, we use
pure gtests with code gen to generate repetitive .cpp files.
**Note: This appears to change a lot of files because many files are
removed since they are now generated at build time.**
## Technical Details
We originally used Tile Engine to facilitate code gen for unit tests
since we found that pure gtests required the addition of many repetitive
.cpp files of the following form:
```cpp
#include "test_gemm_streamk_common_includes.hpp"
template <typename Tuple>
class TestCkTileStreamKBf8 : public TestCkTileStreamK<Tuple>
{
};
#define TEST_SUITE_NAME TestCkTileStreamKBf8
TYPED_TEST_SUITE(TestCkTileStreamKBf8, KernelTypesStreamKBf8);
#include "test_gemm_streamk_atomic_cases.inc"
#undef TEST_SUITE_NAME
```
Due to issues encountered with tile engine, we instead use pure gtests
to generate the repetitive .cpp files. The code generator parses
`KernelTypesStreamK*` type aliases from the types header using a
two-phase approach:
1. At **configure time**, CMake runs the Python script with
`--list_files` to extract the type alias names from the header
(test_gemm_streamk_types.hpp) and compute the list of .cpp file paths
that will be generated. This lets CMake know the exact set of source
files for each target.
2. At **build time**, `add_custom_command` runs the script again with
`--gen_files` to actually emit the .cpp files into the build directory,
triggered only when the types header or generator script changes.
With these changes, we've removed all Stream-K tile engine tests. There
are now 5 targets for Stream-K GEMM tests:
1. test_ck_tile_streamk_atomic_smoke: smoke tests for Atomic reduction
strategy (pipeline: compv3)
2. test_ck_tile_streamk_linear_smoke: smoke tests for Linear reduction
strategy (pipeline: compv3)
3. test_ck_tile_streamk_tree_smoke: smoke tests for Tree reduction
strategy (pipeline: compv3)
4. test_ck_tile_streamk_pipelines_smoke: smoke tests (smaller set) for
pipelines other than compv3
- Since Stream-K can be thought of as a wrapper around universal GEMM,
we don't need to extensively test each pipeline. So, we opt to run a few
tests for different pipelines. Currently, this just consists of the mem
pipeline, but compv4 is coming soon.
5. test_ck_tile_streamk_extended: extended tests
## Test Plan
I have tests the gtests locally on gfx90a, gfx942, and gfx950.
## Test Result
All local tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Stream-K Tile Engine Fixes
## Motivation
Stream-K GEMM in Tile Engine was unable to support instances where the
matrix dimensions were not perfectly aligned due to bugs with padding
support. This PR implements support for padding back into the Stream-K
implementation in Tile Engine along with other minor fixes.
Additionally, this PR introduces a benchmarking script that is standard
for Tile Engine to run all compiled instances with user specified matrix
dimensions.
## Technical Details
- Fixed padding boolean comparison and parsing in gen_single so that
padding flags from the config files are correctly propagated into the
Stream-K template
- Updated trait combo parsing to have the reduction_strategy for
Stream-K in the correct order
- Addition of gemm_streamk_benchmark.py to run all compiled Stream-K
instances
## Test Plan
Tested using the benchmark scripts to run instances.
## Test Result
All instances passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Use ck_pytorch docker from private repo.
## Motivation
Move the pytorch docker image used for CK testing into private repo.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Add sink token gradient support in FMHA backward
pass (#5504)
## Motivation
Adds sink token support to the FMHA backward kernel (dot_do_o pipeline):
## Technical Details
- Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType
- Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs
- Compute per-head sink gradient via atomic accumulation in the pipeline
- Update example runner with reference validation for sink gradient
## Test Plan
Add new test case
## Test Result
WIP
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix async pivot mismatch in persistent GEMM kernel
scheduler (#5776)
## Motivation
Fix pivot mismatch in the persistent GEMM kernel's async input scheduler
that causes **GPU hangs** and incorrect results when used with AsyncTP
(Asynchronous Tensor Parallelism) on ROCm.
PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM
kernel with chunk signals to overlap communication and computation. The
pivot mechanism ensures each rank starts computing from its own local
shard first (which is already available), then moves to remote chunks as
they arrive over the network.
Because of the pivot mismatch, the kernel frequently waits on signals
for chunks that have not yet arrived, while attempting to read data from
completely different chunks. This synchronization desync reliably
triggers infinite hangs during multi-GPU native AsyncTP execution. This
fix is required to enable functional AsyncTP support on ROCm.
## Technical Details
In the persistent kernel loop (`UniversalGemmKernel::operator()`), the
M-tile coordinate used for data selection (`i_m`) and the M-tile
coordinate used for the chunk-signal wait (`chunk_idx`) were derived
from inconsistent bases:
* `i_m` was computed from the **unpivoted** tile index `iM`.
* `chunk_idx` was computed from the **pivoted** expression `(iM +
tile_idx_pivot)`.
This means the kernel could wait for chunk N's signal but then read from
chunk M's memory, or vice versa. The mismatch scales with GPU count:
with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc.
**The Fix:**
Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m`
and `chunk_idx` from it. This guarantees the kernel always waits for the
correct chunk before reading its data.
*(Note: Minor cosmetic `clang-format` changes were also pulled in
alongside the fix).*
## Test Plan
1. Build PyTorch with this CK change.
2. Run the specific multi-GPU AsyncTP native test:
`timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest
test/distributed/test_symmetric_memory.py -k
test_fused_all_gather_matmul_native -q -s -x`
## Test Result
Tests verify correct overlapping execution without hangs or accuracy
mismatches when running the AsyncTP native path with non-zero pivots.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.