[CK Tile] Stream-K RDNA Support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Currently, CK Tile Stream-K only supports CDNA architectures. This
change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures.
## Technical Details
Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear,
and 3) tree. The linear and tree reductions require inter-workgroup
communication to a global flags buffer and a global partials buffer. To
ensure cache coherency, we use cache modifiers to skip cache levels that
are not visible to all workgroups. On CDNA architectures, scalar load
and scalar store instructions are available, which we use to read and
write to the flags buffer with appropriate cache skipping modifiers.
However, RDNA architectures do not support scalar store instructions, so
workgroups must use a buffer store instruction to write to flags.
Additionally, cache modifiers differ between CDNA and RDNA; they also
differ between RDNA3 and RDNA4. Given this information, the main changes
are as follows:
- Added RDNA flag signaling: Use buffer store instructions for writing
to global flags buffer
- Add appropriate cache modifiers for reading and writing to flags and
partials:
- RDNA3 (gfx11): Use `glc | dlc` coherence flags
- RDNA4 (gfx12): Use `DEVICE` coherence scope
- SFINAE-guarded overloads: Added compile-time dispatch for
`SignalStorePartialDone()` and `WaitStorePartialDone()` based on target
architecture
- RDNA alignment requirements: Increased flags buffer alignment from
128B to 256B due to RDNA cache line size
**A note about the `amd_buffer_coherence_enum`:**
- **Problem:** The `amd_buffer_coherence_enum` uses preprocessor
conditionals (`#if defined(__gfx12__)`) to define architecture-specific
values. Template specializations reference enum values from different
architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name
lookup, non-dependent names are resolved during template parsing
regardless of which architecture is being compiled, causing compilation
failures when referenced values do not exist in the active preprocessor
branch.
- **Temporary Solution**: Added compatibility enum values to each
architecture block. For example, I added `glc_dlc` in the `__gfx12__`
block. I will create a ticket to refactor this enum with a design that
has better scalability and tries to avoid the use of preprocessor
conditionals.
## Test Plan
### Summary
gtests were added to test wmma variants of Stream-K. These tests were
stressed tested locally on gfx11 and gfx12.
### More details
This PR makes the following changes/additions to the Stream-K gtests:
- Split tests into MFMA (CDNA) and WMMA (RDNA) variants
- Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction
- WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size
supported on RDNA)
- Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU
count
- As described in [HIP
documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898),
when in Workgroup Processor (WGP) mode, the value of
`hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single
WGP contains two CUs. The default mode on RDNA is WGP mode, so when
creating (M, N, K) instances for gtests using the CU count, we need to
multiply the CU count by 2 to get the correct value. This is not needed
in the kernel host code, because the occupancy ensures that overall
`max_active_wgs` is correct.
## Test Result
All tests pass locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[Ck][CK Tile] Global Load/Store for Large Tensors support
(#6744)
## Motivation
Create solution to support large tensors in the entire ck tile.
## Technical Details
- add possiblity to use global load
- int64 indexing
## Test Plan
conv fwd tests
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-913
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
=?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
This PR adds two specialisations related to `tf32_t`.
## Technical Details
This change treats `tf32_t` as a concrete type rather than an empty
`struct`. It also adds two new specialisations for MFMA dense builtins
and resolves existing circular include issues.
## Test Plan
All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix latest build issues with staging compiler.
## Motivation
Fixing new warnings with staging compiler.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility (#7829)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
CK source files must be compilable via **hipRTC (HIP runtime
compilation)**, whose preprocessor does not accept non-ASCII bytes
anywhere in a translation unit — **including in comments**. Bytes that
are harmless under `hipcc` (em-dashes, smart quotes, multiplication
signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at
preprocessing time. These regularly leak in via LLM-assisted authoring
or copy/paste from formatted documents and silently break hipRTC paths
that are not exercised by the default `hipcc`-based build matrix.
This PR (a) cleans every existing violation (53 files) and (b) adds a
pre-checkin gate so new violations are rejected before merge.
## File extensions covered
Both the cleanup scan and the new Jenkins enforcement stage use the same
predicate:
```
*.h *.hpp *.cpp *.h.in *.hpp.in *.cpp.in *.inc *.cl
```
(excluding `*/build/*` and `*/include/rapidjson/*`). This is a strict
superset of the existing `Clang Format` stage's predicate — `*.inc` is
added so test-fixture include files are also gated. The local pre-commit
hook's `c++/inc` type filter covers the same set.
## Why no enforcement today
CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so
the existing `pre-commit` workflow doesn't touch CK. The local CK
`.pre-commit-config.yaml` only runs for developers who installed hooks.
The **authoritative gate is therefore the new Jenkins stage** in this
PR; the local hook is convenience.
## Commit layout (bisect-friendly)
1. `79798aa6261` — **`[ck] Convert reflect/ rendering to ASCII for
hipRTC compatibility`**
Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `|-
/ +- / | ` (3-col width preserved so alignment is unchanged).
`conv_description.hpp` swaps `×` for `x` as the dimension separator.
`test_conv_description.cpp` expected strings updated in lockstep so the
snapshot test stays green. This is the only commit in the series with
observable runtime impact.
2. `738fdb0d81c` — **`[ck] Strip non-ASCII bytes from C++ sources for
hipRTC compatibility`**
Mechanical text cleanup across 53 files. Replacements happen in comments
or in `std::cout` strings that are not asserted on by any test. None of
the 174 `.inc` files in the tree required edits, but they were in the
scan's predicate so the enforcement stage's predicate is a superset of
what was scanned. Full replacement table in the commit message.
3. `1d7cd8ba235` — **`[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility`**
- New `projects/composablekernel/script/check_ascii_only.sh` (modeled on
`check_copyright_year.sh`).
- New entry in `projects/composablekernel/.pre-commit-config.yaml` under
the local-hooks block (`types_or: [c++, inc]`).
- New `ASCII Only Check` parallel stage in
`projects/composablekernel/Jenkinsfile`'s `Static checks` block,
mirroring the existing `Clang Format` stage but with `*.inc` added to
the find predicate. Always-on, no `RUN_CPPCHECK` gate.
The tree is buildable at every commit boundary. Commit 1 leaves 50 known
violations; commit 2 leaves 0; commit 3 wires the gate.
## Demo
Script output on a synthesized violation:
```
$ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp
$ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp
ERROR: /tmp/bad.cpp contains non-ASCII bytes:
1:// em-dash test — here
Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.)
$ echo $?
1
```
Full repo scan after the cleanup commits (note the `-name '*.inc'`
clause):
```
$ cd projects/composablekernel && find . -type f \( -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \
-o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' \) \
-not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \
| xargs -0 -P 8 -n 64 script/check_ascii_only.sh
$ echo $?
0
```
## Test plan
- [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check`
stage runs green over the full predicate (incl. `*.inc`) and existing
`Clang Format` stage is unaffected.
- [ ] `test_conv_description` passes against the ASCII tree-formatter
output (touched in commit 1).
- [ ] Local: `pre-commit run ascii-only-checker --all-files` runs
cleanly after installing CK pre-commit hooks via
`script/install_precommit.sh`.
- [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file,
push: confirm Jenkins fails the new stage with a clear error.
- [ ] Spot-check a representative subset of touched files under hipRTC
compilation to confirm no remaining hipRTC-blocking content (optional,
since the static byte check is a sufficient condition for hipRTC
preprocessor acceptance on this dimension).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
composablekernel: remove stray *.hpp.bk backup artifacts
(#7974)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Four `*.hpp.bk` files were accidentally committed to
`projects/composablekernel/`, likely as leftovers from a prior merge or
conflict resolution. Each is an older snapshot of its `.hpp` counterpart
— the canonical `.hpp` files are newer and contain the correct current
content.
## Deleted files
| File | vs. `.hpp` counterpart |
|---|---|
| `ck_tile/core/tensor/tile_window.hpp.bk` | Older version: uses legacy
`bool isL1Cache`/`PrefetchL1` template params; missing
`DataCachePrefetchKind`-based prefetch API and `data_cache_prefetch.hpp`
include |
| `ck_tile/core/tensor/load_tile_transpose.hpp.bk` | Older version:
missing `#if defined(__gfx950__)` guard and `Quad` struct (~90 lines)
for gfx1250 architecture |
| `ck_tile/ops/gemm/warp/warp_gemm_dispatcher.hpp.bk` | Older version:
missing `WmmaTag`, `IsScale16` template param, and several newer
dispatcher specializations |
|
`ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp.bk`
| Older version: `KPackA`/`KPackB` (since renamed `KPack`); uses
`static_ford` (since refactored to nested `static_for`) |
## Verification
- No other `.bk` files exist in `projects/composablekernel/`.
- No build scripts, CMake files, includes, or documentation reference
these `.bk` files.
- No `.hpp` files were modified.
[CK TILE][Windows] add `msvc::no_unique_address` support for
Windows (#7786)
## Motivation
While building Flash Attention 2 with CK backend, this warning will spam
in every kernel:
```
DEBUG [1/1837] hipcc.exe ...
DEBUG In file included from H:\ROCm\flash-attention\build\fmha_fwd_d32_bf16_batch_b64x64x16x32x32x32_r4x1x1_r4x1x1_w16x16x16_w16x16x16_qr_vr_pssk_nlogits_alibi_mask_lse_ndropout_nskip_nqscale_ntrload_nsink_gfx12.cu:6:
DEBUG In file included from H:\ROCm\flash-attention\csrc\composable_kernel\example\ck_tile\01_fmha\fmha_fwd.hpp:6:
DEBUG In file included from H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core.hpp:111:
DEBUG H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core/tensor/tile_scatter_gather.hpp:1246:7: warning: unknown attribute 'no_unique_address' ignored [-Wunknown-attributes]
DEBUG 1246 | [[no_unique_address]] std::conditional_t<kUseGlobalLoad_, PageIdxArray, gl_field_empty_t>
DEBUG | ^~~~~~~~~~~~~~~~~
DEBUG H:\ROCm\flash-attention\csrc\composable_kernel\include\ck_tile/core/tensor/tile_scatter_gather.hpp:1254:7: warning: unknown attribute 'no_unique_address' ignored [-Wunknown-attributes]
DEBUG 1254 | [[no_unique_address]] std::conditional_t<kUseGlobalLoad_, index_t, gl_field_empty_t>
DEBUG | ^~~~~~~~~~~~~~~~~
DEBUG 2 warnings generated when compiling for host.
...
```
## Technical Details
`[[no_unique_address]]` is not working on Windows LLVM, should use
`[[msvc::no_unique_address]]`.
## Test Plan
Build FA2 with CK backend.
## Test Result
No warnings, no errors.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
=?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
This PR solves several issues:
#### More accurate tests for MmaPipelines
The current tests for the MmaPipelines (test_amdgcn_sparse_mma,
test_amdgcn_wavewise_mma) use explicit input fragment vectors filled
with 1s, and only check the output of a single lane. We should have
tests that actually use the MmaPipelines with non-trivial input matrices
and verify the complete output.
Some other aspects of the current MmaPipelines tests that I noticed and
deserve some attention:
1. There is sometimes iteration over K outside of the pipeline, which is
then included in WaveTileK or FragK, which is not correct. We should
remove it, move K iteration inside of the pipeline, or be more clear
about this outer-K loop size and how it propagates downwards.
2. There is very tight coupling between the kernel, gtest code, and
test_pipeline helper, requiring a lot of information and functions to be
passed back and forth.
3. The test_pipeline helper is doing a bunch of register-related logic
on the host (related to point 1)
4. Without this register logic the only thing it does is check the
device, call the kernel, and check the output, but with a lot of
boilerplate.
#### Test helper for detecting target arch at HOST runtime
There is a really apparent issue we faced while writing tests:
Scenario:
1. Compile a test that supports both gfx950 and gfx1201 for gfx950
2. Run the test on a server that only has gfx1201 GPU
Actual:
Segmentation fault
Expected:
The test can correctly detect from HOST runtime that the DEVICE
target_id was different and skips the test.
Notes:
The only way of detecting the COMPILER_TARGET_ID in the existing "arch"
framework is launching a kernel and calling `get_compiler_target()` (so,
from a DEVICE code). This will create a segmentation fault if the
current arch differs from the target arch. To cope with this issue, we
propose to export the compiler target(s) (note they can be many) through
`projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and
define a test helper to deal with such cases.
#### Add composition support to Transforms
We have a small number of Transforms which act on MmaOp input and output
data, before and after the MmaOp call respectively. These are currently
implemented to work on an MmaTile level, but in theory they are also
supposed to work at a WaveTile level, i.e. after composition of multiple
MmaTiles to create larger effective MNK dimensions. Currently the
composed MmaTiles look like 2D C-style arrays of the individual MmaTile
level register vectors (see WaveWiseMmaPipeline). The transforms should
be able to take these and perform the proper transforms to the whole
WaveTile at once. This might allow for better performing
transformations.
Note: This PR handles the SparseTransform case and if we don't end up
doing scale as a transformation, there isn't really much left to do. If
we end up having only the sparse transform as a non-trivial transform,
then we could also consider removing the Transform framework.
[CK] Fix gfx950 AITER Sync Regressions
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
Fixes three gfx950 regressions in the AITER downstream CI that surfaced
after the internal/gfx1250 re-sync (ROCm/rocm-libraries#6978):
> **Companion aiter PR:** ROCm/aiter#3392 — host-side adaptations
(`Kernel::BlockSize()` `constexpr` drops, blockscale `KBatch=1` clamp)
plus the CK submodule bump used to validate these fixes together.
- **FlyDSL MoE AOT cache miss** — the AITER MoE tests run with
`check_aot_cache=True` and fail on any FlyDSL JIT cache miss, but the CI
never pre-compiles the FlyDSL MoE kernels, so gfx950 always misses.
Pre-compile them at the start of the AITER test stage.
- **`buffer.load.lds.v4i32` link error** — ROCm/rocm-libraries#6978
reintroduced a clang-version guard mapping
`llvm.amdgcn.raw.buffer.load.lds` to a `.v4i32`-suffixed name. That name
exists in no LLVM (the rsrc operand is a fixed, non-overloaded `<4 x
i32>`, so the intrinsic is never type-mangled), so gfx950 4-DWORD
direct-to-LDS (e.g. fp4 MoE bpreshuffle) fails to link with `lld:
undefined symbol: llvm.amdgcn.raw.buffer.load.lds.v4i32`. Use the
canonical plain name unconditionally.
- **mixed-precision flatmm warp-GEMM call** — ROCm/rocm-libraries#6978
generalized the scaled `WarpGemmImpl::operator()` from a fixed `<index_t
opselA, index_t opselB>` signature to a variadic `<typename... Params>`
one and updated the `mx_flatmm` pipeline to pass the op-selectors as
`OpSelA<>`/`OpSelB<>` types, but missed the mixed-precision flatmm
pipeline (`F8xMXF4`/`F16xMXF4`), which still passed raw integer
op-selectors. These no longer bind to `typename... Params` (`error: no
matching member function for call to 'operator()'`), breaking
compilation of the fp8/bf16 × fp4 cktile MoE gemm1 instances on gfx950
(aiter `test_moe_2stage`). Wrap the op-selectors in
`OpSelA<>`/`OpSelB<>`.
## Changes
- `Jenkinsfile`: pre-compile the FlyDSL MoE AOT cache (`python3
aiter/aot/flydsl/moe.py`) before the AITER tests.
- `include/ck/utility/amd_buffer_addressing_builtins.hpp` and
`include/ck_tile/core/arch/amd_buffer_addressing_builtins.hpp`: drop the
`__clang_major__` guard and always use
`__asm("llvm.amdgcn.raw.buffer.load.lds")`. The plain name is the
canonical one for all sizes including the gfx950 16-byte form, as the
upstream LLVM gfx950 tests confirm.
-
`include/ck_tile/ops/flatmm/pipeline/mixed_prec_flatmm_pipeline_agmem_bgmem_creg_v1.hpp`:
wrap the warp-GEMM op-selectors in `OpSelA<>`/`OpSelB<>` at the five
call sites, matching the `mx_flatmm` pipeline.
## Test plan
Validated via CI.
[CK] Upgrade to new gfx1250 compiler and fix build issues
(#7960)
## Motivation
The docker image we've been using to build for gfx1250 is a few months
old, so we need to upgrade. Some of the changes in the latest compiler
version require changes in the code. TDM is temporarily disabled due to
changes in the lds load/store intrinsics.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight
instances (#7937)
## Motivation
CK Tile had no pipeline competitive with old CK's wavelet on the
RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a
wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive
kernel for spatial small-K shapes.
## Technical Details
- New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`):
workgroup split into math waves (LDS read + MFMA) and load waves (DRAM
read + LDS write).
- VGPR role-split: `operator()` has two top-level mutually-exclusive
`is_math` branches so the allocator overlays both roles onto the same
physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy.
Correctness depends on identical `block_sync_lds` counts on both arms
plus a matching load-wave barrier stub in the epilogue
(`cshuffle_epilogue.hpp`).
- Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`):
`kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub.
Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36
C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27%
(googlenet K=320 by 16-23%); the role-split roughly halves the parity
gap vs old CK on the 13x13 fp16 shape.
## Test Plan
- `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16,
`split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a
broad 2D sweep.
- Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64}
(barrier-parity / deadlock check).
- `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet
instances.
## Test Result
- All wavelet instances CPU-verify correct across the split-K sweep; no
hangs (dual-arm barrier sequence matches).
- Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best
non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better
vs old CK on the majority of spatial shapes.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile
## Motivation
This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and
improves warp GEMM unit test coverage/structure for gfx1250-specific
cases.
## Technical Details
- Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added
IsScale16 plumbing to warp GEMM dispatcher path
- Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250
coverage to verify all F4 WMMA paths
## Test Plan
Run ./test_ck_tile_wg_32x16x128_fp4.
## Test Result
```
./test_ck_tile_wg_32x16x128_fp4
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1751 ms total)
[ PASSED ] 3 tests.
```
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Use gfx11 float buffer atomics in FMHA Bwd
## Motivation
FlashAttention CK backward on gfx11 can hit out-of-bounds/tail writes in
the dQ accumulator atomic-add path when sequence rows are padded at the
tile level but not marked invalid in the DQDKDV main tensor view.
With the generic global atomic fallback, an incorrectly-valid tail
element can issue an actual pointer-based `atomicAdd`. With the buffer
atomic path, the write is issued through a buffer resource with bounds
information and follows the same backend already used by gfx9/gfx12.
This fixes the gfx11 FMHA BWD failure without changing the gfx11 default
for unrelated CK Tile kernels.
## Technical Details
This PR enables the existing CK Tile AMD buffer float atomic-add path
only for generated FMHA BWD gfx11 translation units.
gfx11 normally uses the generic global atomic fallback for
floating-point `buffer_view::atomic_add`. That fallback performs the
atomic through a raw computed pointer and depends on the software
validity predicate to avoid invalid elements. In FMHA BWD dQ
accumulation, padded tail rows can reach this path, so using the buffer
atomic backend is safer: it uses a buffer resource with base pointer,
bounds information, and an element offset, matching the backend already
used by gfx9/gfx12.
Enabling `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT` globally for gfx11 is
too broad and can break unrelated gfx11 CK builds such as GEMM. Instead,
`config.hpp` now preserves an explicitly pre-defined
`CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT`, while keeping the existing
default disabled for gfx11.
## Test Plan
Validated the change with the FlashAttention CK full test suite with
backward pass enabled on gfx11.
pytest -q -s tests/test_flash_attn_ck.py
## Test Result
FlashAttention CK gfx11 test result:
260680 passed, 152076 skipped
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC
## Motivation
Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async
load instruction for tensor A, which complements the existing MX GEMM
FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines
are enabled on gfx1250.
## Technical Details
The existing gfx950 implementation was extended to support gfx1250
architecture. All three MX FP data types are supported across the two
ASICs.
It should be noted that while the TDM pipeline uses an emulated
32x32x128 warp-tile instruction, the present submission relies on the
built-in 16x16x128 instruction, called 4 times per warp.
## Test Plan
Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250
functionality.
To help facilitate the testing in development,
`example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was
introduced to verify various combinations of supported data types and
pipeline versions.
## Test Result
The present submission is expected to work on both gfx950 and gfx1250
hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
- [x] Relies on #6978 and should only be merged after the changes are
merged to the `develop`.
[CK] suppress compiler warnings while building pytorch. (#7760)
## Motivation
Recently added compiler flags that are required to suppress false
warnings by latest staging compiler are not recognized by older compiler
versions and are triggering an avalanche of warnings. Previous attempt
to suppress them by using -Wno-unknown-warning-option flag didn't help,
because that flag wasn't recognized either and just added more warnings.
I've verified that current approach by checking the clang version
actually works as intended and makes the warnings go away.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388)
## Motivation
Improve precision of mxfp4 without performance penalties.
## Technical Details
Since performance of scale MFMAs is the same when neither A nor B is
fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the
second GEMM, while types of Q, K, V stay the same.
This allows to improve overall precision significantly because fp6 has
32 non-negative values used for P quantization compared to just 8 values
for fp4.
It was found that there is a compiler bug with
`__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in
LCOMPILER-561) but a workaround seems to fix all failing instances.
## Test Plan
```
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```
## Test Result
The tests must pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Unification Work – Add MFMA specialisations for `fp64_t` (#7104)
## Motivation
This PR adds two specialisations related to `fp64_t`.
## Technical Details
This adds two new specialisations for MFMA dense builtins, and adjusts
ABLayout and CLayout to L{K1BM} and L{M1BN}.
## Test Plan
All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Adding WMMA wrappers for sparse builtins (#6567)
## Motivation
This PR is part of the [WMMA/MFMA] unification work. It's the third of
the series of PRs (after
https://github.com/ROCm/rocm-libraries/pull/5801 and
https://github.com/ROCm/rocm-libraries/pull/6014) that add all the
necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse
WMMA intrinsics.
## Technical Details
This change adds new specializations for WMMA sparse builtins. In total,
we add 8 WMMA builtins.
## Test Plan
All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.
## Test Result
Test pass locally, waiting for the CI.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] upgrade CI to rocm7.13 as default compiler (#7612)
## Motivation
Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
[CK Tile] Fix Grouped Gemm quant mixed precision (#7537)
<Migrate from Internal repo PR>
test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8
cases:
std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant,
False, True, False>,
std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant,
False, True, False>
GFX1250 would fail with incorrect results, GFX950 would fail when
compiling BF8+FP8 and give incorrect results for FP8+BF8.
The issue is due to the wrong ComputeDataType selection.
The fix is to consider original ADataType and BDataType even when
ComputeDataType is not void. For compiling error on gfx950, the bf8,
fp8, 16x16x32 warp Gemm is added.
Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112)
## Motivation
The goal of this work is to apply XOR shuffle (swizzle) to the current
`comp_async` GEMM pipeline and the `gemm_mx` pipeline.
XOR swizzling has been helpful to avoid LDS bank conflicts, as data are
redistributed across LDS banks, such that simultaneous threads accessing
different rows land on different LDS banks.
## Technical Details
A similar approach to the work in the existing eight-waves pipeline was
followed.
Currently, XOR swizzle support is available for FP8 and BF8 types.
FP4 support is also available for MX GEMM.
Should the types not match, or should the async vector width be of an
unsupported size, then the pipeline falls through to the previously
existing ('unswizzled') path.
## Test Plan
Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM
pipeline.
Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for
the MX GEMM pipeline.
## Test Result
The tests passed successfully in the `Alola` cluster with MI350
hardware.
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK Tile] Adding MFMA wrappers for dense builtins (#6014)
## Motivation
This PR is part of the [WMMA/MFMA] unification work. It's the second of
the series of PRs (after #5801) that add all the necessary MMA builtins
as `amdgcn_mma` structs. This PR focuses on dense MFMA intrinsics.
## Technical Details
This change adds new specializations for WMMA dense builtins. In total,
we add 55 MFMA builtins.
## Test Plan
All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.
## Test Result
Test pass locally, waiting for the CI.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Support multi-vector reads in static encoding patterns (#7528)
## Motivation
The thread-raked / warp-raked / block-raked static tile distribution
patterns in `ck_tile` silently produce wrong results when the contiguous
tile dimension is larger than `warp_size * vector_size`, because the
encoding has no per-thread iteration dimension along X.
Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in
`ConvConfigComputeV3`, the grouped convolution backward-weight example
reports about 50 percent wrong values, with errors starting exactly at
the `X0*X1 = 64` boundary. The second pass over the contiguous dim is
never performed.
This PR extends the encoding so multi-vector reads in the contiguous
tile dimension are supported, while keeping every existing call site
bit-for-bit identical.
## Technical Details
Three files changed.
### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp`
Add a per-thread X iteration dimension in all three raked
specializations:
- `X0 = min(warp_size, XPerTile / X1)` — threads in X dim
- `X1 = min(LargestVec, VecSize)` — vector size per access
- `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new)
`X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both
`make_2d_static_tile_distribution()` and
`make_shuffled_2d_static_tile_distribution()`.
The new encoding places `X2` in the middle of the Ys iteration list,
which preserves reverse symmetry between the regular `<..., X2, X1>` and
shuffled `<X1, X2, ...>` encodings.
Patterns updated: `thread_raked`, `warp_raked`, `block_raked`.
### 2. `include/ck_tile/core/tensor/transpose_tile.hpp`
Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch
alongside the existing `NDimY == 2` branch. The original branch is
byte-for-byte unchanged.
Both branches dispatch to the same `transpose_tile2d_impl_in_thread`,
whose body has always been NDimY-generic (iterates with `static_for<0,
NDimY, 1>` and `number<NDimY>{}`).
### 3.
`experimental/grouped_convolution_tile_instances/generate_instances.py`
Removed the two now-obsolete skip guards in `parse_bwd_weight_instances`
and `parse_bwd_data_instances`:
```python
if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector):
print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.")
continue
```
Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating,
irregular-load shapes, scalar-per-vector > tile size) are kept
untouched.
### Compatibility
Strict. Every existing caller has `X2 == 1` and therefore hits the
original encoding path verbatim. No upstream config or pipeline behavior
changes.
## Test Plan
The grouped convolution example is the natural exerciser since
`GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for
both A and B tiles, and all three conv directions share the same
`ConvConfigComputeV3`.
For each test below we ran:
```
./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}]
```
with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both
code paths:
| Test | M / N / K | VecA/B/C | A path | B path | dtype |
|------|-------------|----------|------------|----------------|-------------|
| T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 |
| T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 |
| T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 |
| T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)|
| T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3
dir) |
A larger T4a (256/256/128) was attempted to stress both A and B with
X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128
KB > 64 KB limit), independent of this PR.
For the generator change we ran:
```
python3 generate_instances.py --mode profiler --direction all
```
and verified `Skipping instance ... with multiple warps per continous
tile dim` no longer appears (count went from non-zero to 0); other skip
categories are unchanged.
`clang-format-18` was applied to both modified `.hpp` files (matches the
repo's `.clang-format`).
## Test Result
- T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`.
Confirms existing callers are unaffected.
- T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3
encoding + transpose branch.
- T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3
`correct`.
- T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs
`correct`.
- Generator: 0 `multiple warps per continous tile dim` skips remaining;
other skips unchanged.
Sample run output (T4b, bf16, bwd_data):
```
shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32
pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0
Vector size A: 1, Vector size B: 1, Vector size C: 1
0.934907 ms, 8.34683 TFlops, 34.3178 GB/s
Relative error threshold: 0.00390625 Absolute error threshold: 0.25
The CPU verification result is: correct
```
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
[CK Tile][MFMA/WMMA unification] Add support for packed datatypes (tiny types) (#6088)
## Motivation
This MR makes all the changes required for the unified architecture to
be able to deal with packed datatypes i.e. int4, fp4, fp6, and bf6. The
crux is that layout parameters should be interpreted as describing the
pure mathematical matrix fragments, while the ext_vectors and tile
distribution encodings describe everything in terms of packed datatype
units. This matches how packed types are dealt with in ck_tile and
should play nicely with the load and store tile ops once we integrate
the unified framework into CK tile.
The bf6 datatype was added to CK tile in the form of pk_bf6x16_t and
pk_bf6x32_t, which did not exist before.
The ext_vector implementations of pk_fp6x16_t and pk_bf6x16_t (vec size
1 and 2) were extended to make the subscripting operator work as
expected.
The layout test was adapted to be compatible with all packed datatypes,
and all new intrinsics were added to the test.
This MR adds ALL intrinsics across ALL architectures which use packed
datatypes, as well as ALL scale intrinsics:
mfma_scale_f32_16x16x128_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6,
BF6xBF6)
mfma_scale_f32_32x32x64_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6,
BF6xBF6)
wmma_i32_16x16x16_iu4_w32
wmma_i32_16x16x16_iu4_w32_gfx12
wmma_i32_16x16x32_iu4_w32_gfx12
## Testing
All intrinsics were tested on all architectures.
[CK Tile] Eight Waves pipeline for MX GEMM (#5552)
## Motivation
Integrate Eight Waves pipeline in MX GEMM
## Technical Details
- EightWaves pipeline:
- Add pipeline, policy and block gemm (internally using existing
implementation used by GEMM and ABQuant)
- Extend support of EightWaves policy for FP4 (packed types)
- Async pipeline:
- Fix pipeline with packed scales (requires MRepeat and NRepeat to be
contiguous)
- block gemm specific for MX GEMM is defined because distribution
encodings have changed
- CShuffle:
- Add new functionality to support MRepeat and NRepeat contiguous
(defined by `TilesPacked`)
- Examples:
- Refactor examples to easily switch different configurations (similar
to GEMM universal)
- Scales values generated consistently with other microscale
implementations in CK Tile
- Add configuration for EightWaves pipeline
- Tests:
- Unify existing FP8 and FP4 tests
- Add tests for EightWaves pipeline
- Scales values generated consistently with other microscale
implementations in CK Tile
Note: FP6 support for MX GEMM was added later and the support for the
Eight Waves pipeline will be done in following PR
## Test Plan
Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and
FP8
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Fix for #6207 (#7543)
## Motivation
PR #6207 introduces an error. This PR is the fix of it.
## Technical Details
Adds a path for GFX1250 in `to_string`
## Test Plan
Test has already included.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Unification Work – Add `print()` Utility to `MmaOpTraits` (#6207)
## Motivation
It would be useful to have a `print()` utility inside of unification
work's code scope, so that we can print all template params and derived
params of `amdgcn_mma` for easier debugging.
## Technical Details
Adding helper functions and struct to traits, adding `print_flags()` for
each `Default*CtrlFlags`, `amdgcn_target` and `MmaOpTraits` structs, and
adding `print()` for `amdgcn_mma`.
Note: the first commit is **not** in the scope of this PR. This PR
should be merged after https://github.com/ROCm/rocm-libraries/pull/5801
and https://github.com/ROCm/rocm-libraries/pull/5857.
## Test Plan
Adding test in layout test.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] add composable kernel support on gfx1250 (#6978)
## Motivation
Add composable kernel support on gfx1250.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
[CK] Suppress new staging compiler errors (#7384)
## Motivation
This should make new builds with staging compiler pass.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix latest batch of staging compiler warnings (#7111)
## Motivation
Suppress the new batch of clang lifetimebound and invalidation warnings
with the latest staging compiler.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Adding WMMA wrappers for dense builtins (#5801)
## Motivation
This PR is part of the [WMMA/MFMA] unification work. It's the first of
the series of PRs that add all the necessary MMA builtins as a
`amdgcn_mma` structs.
## Technical Details
This change adds new specializations for WMMA dense builtins. In total,
we have now 9 RDNA4 builtins and 3 RDNA3 builtins.
## Test Plan
All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.
## Test Result
Test pass locally, waiting for the CI.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Yung-sheng Tu <yung-sheng@streamhpc.com>
[CK_TILE] fix(fmha): support >2GB KV cache in batch prefill via template dispatch (#6653)
## Motivation
The CK batch prefill kernel previously failed (silent overflow + page
faults) when the KV cache exceeded 2 GB, blocking long-context inference
workloads (e.g., 128K+ token contexts with paged KV).
Two distinct failure modes were addressed:
1. **>4GB SRD overflow (`page_size < kN0`):** The SRD
`buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small
page sizes the rebased SRD spans the full KV pool and the offset wraps
past 2 GB, corrupting K/V loads.
2. **gfx950 page-table fault (`page_size >= kN0`):** On CDNA4 the
hardware validates the **full SRD `num_records` range** against
page-table permissions (CDNA3 only checks per-instruction `voffset`).
After per-tile SRD rebase, an un-trimmed `num_records` field extends
past the live page and faults on freed/protected memory.
## Technical Details
**Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad`
template parameter:**
| Case | `page_size` | KV cache size | Mode | Load path | Addressing |
|---|---|---|---|---|---|
| 1 | `>= kN0` (large pages) | any | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, bounded by per-page rebase |
| 2 | `< kN0` (small pages) | `<= 2 GB` | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, fits in INT32 byte range |
| 3 | `< kN0` (small pages) | `> 2 GB` | Global-load
(`kUseGlobalLoad=true`) | `async_load_tile_raw_flat` (K) +
`load_tile_flat` (V) | 64-bit |
**Dispatch:** the auto-gen API layer (`fmha_batch_prefill.py`) selects
the kernel instantiation at launch from `(page_block_size,
num_total_pages * batch_stride_k * kElementBytes)`, so the small-page
penalty is paid only when correctness requires it.
**gfx950 SRD `num_records` trimming:** in the K and V rebase lambdas of
`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`,
`set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after
each rebase to constrain `num_records` to the live page. Required for
CDNA4 page-table validation; harmless on CDNA3.
**Pipeline sync for the global-load path:**
- V uses synchronous `load_tile_flat`; K uses
`async_load_tile_raw_flat`.
- `v_physical_pages_current` is double-buffered so the V flat load
doesn't race against the next iteration's K rebase computation.
**Arch guards:** `global_load_lds` intrinsics are gated to `__gfx94__` /
`__gfx950__` (CDNA3+). Other architectures hit a `dependent_false`
static_assert with a descriptive message.
**Device-side assertion convention:** SRD setters use
`__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s
`assert()`. The latter introduces an `__assert_fail` call whose register
pressure scatters the K-SRD scalar register window across conditional
branches, corrupting `buffer_load_dwordx4` on gfx950.
## Test Plan
Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper
test suite. All coverage lives in **`op_tests/test_batch_prefill.py`**:
- **Functional matrix (96 cases)** — `test_batch_prefill`: `page_size ∈
{1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16,
fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1,
4}` (parametrized to exercise per-sequence SRD rebase across batch
boundaries).
- **>2 GB coverage** — `test_batch_prefill_large_kvcache`: extended to
allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true`
(small-page) and `kUseGlobalLoad=false` (large-page rebase) paths.
Includes both single-batch and multi-batch (`batch_size=4`) cases to
exercise per-sequence SRD rebase across the >2 GB pool.
- Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol`
from the existing batch prefill test harness.
## Test Result
| Arch | `test_batch_prefill` | `test_batch_prefill_large_kvcache` (>2
GB) |
|------|----------------------|---------------------|
| MI308 (gfx942) | All passed | Passed |
| MI355 (gfx950) | All passed | Passed |
**Performance impact (gfx950, hot SRD path):**
- +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang /
causal / soft_cap=30`, attributable in full to the two
`set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas
(5-run median, signal/noise ≈ 9×).
- This cost is **mandatory for gfx950 correctness** on >2 GB workloads —
removing the setters re-introduces page-faults.
- gfx942: 0 regressions in the same range (all configs ≤ +0.97%).
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550)
## Motivation
New changes from upstream llvm-project cause an avalanche of warnings in
CK. Gonna disable them by ignoring the
lifetime-safety-intra-tu-suggestions flag until a better permanent
solution is found.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Enable canonical-NaN BF16 conversion for FMHA on RDNA (#6253)
## Motivation
- On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA
forward adds noticeable overhead and causes a meaningful performance gap
versus fp16. The asm-based path (mode 3) does not improve this on RDNA
and can perform even worse.
- In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower
than the corresponding fp16 path.
- This PR reduces that gap by switching FMHA forward to a different BF16
conversion strategy based on Triton’s canonical-NaN
round-to-nearest-even behavior.
## Technical Details
- Add a new `standard_cnan` BF16 conversion mode to CK Tile.
- Implement a canonical-NaN RTN `float -> bf16` conversion path based on
the Triton implementation.
- Enable this conversion mode by default for FMHA forward builds
targeting gfx11/gfx12.
- Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some
`hdim=128` cases to keep kernel selection aligned with the updated
conversion behavior.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- all tests passed when running `test_ck_tile_fmha`
- BF16 FMHA forward performance improves by up to ~5% on gfx11.
- BF16 FMHA forward performance improves by up to ~10% on gfx12.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Unification of Scale MFMA/WMMA Policy Structs (#5857)
## Motivation
The existing unification work supports DENSE and SPARSE intrinsics. In
this PR, we enable support for SCALE intrinsics and add example SCALE
implementations.
## Technical Details
Adding MFMA SCALE intrinsics support, adding tests for MFMA SCALE
intrinsics, and adding WMMA SCALE policy trait.
Note: fp6 SCALE intrinsics support is not included in this PR, as its
handling in ck_tile is currently more specialized and does not follow
the same pattern as other datatypes.
## Test Plan
Added new tests for the relevant SCALE specialisations.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Skip fp16 dropout d256 batch tests for compiler VGPR aliasing bug (#6342)
## Summary
- Skip fp16 FMHA forward dropout tests that use the d256 tile in batch
mode, gated on compiler version
- The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of
Philox RNG parameters under high register pressure (383 VGPRs)
- bf16 dropout tests are unaffected and cover the same code paths
## Root Cause
The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state
stored in VGPRs) with other live data during the softmax main loop. This
causes corrupted `buffer_store_byte` writes for dropout randval on wave
lanes 32-63, producing NaN in output and LSE tensors.
**Conditions:** fp16 + d256 tile + dropout + batch mode + `qr` pipeline
+ gfx90a
## Changes
- `include/ck_tile/core/config.hpp`: Add
`CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro
- `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in
`TEST_P(Dropout, ...)`
## Test plan
- [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip
active)
- [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED
- [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)
[CK Tile] Unification work - mma transformations pipeline (#5508)
## Motivation
In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing.
For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform.
A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR.
The pipeline base class is basically at the RFC stage.
We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline.
## Technical Details
The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed.
## Test Plan
New tests are added.
## Test Result
Tests should pass.
---------
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
[MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400)
Reverts ROCm/rocm-libraries#6343
This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.
[CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343)
## Motivation
Due to compiler version update, there are test failures in the test
suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current `develop` branch, the test
failures are blocking any PR merges into `develop`. An example of a
failed CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as tests are passing for clang compiler version 20.0 and failing
for clang compiler version 22.0.
## Technical Details
This PR disables the compilation of the problematic bwd weight conv
instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that
allows us to detect when we are compiling for `gfx90a`. Using the new
CMake flag, compilation of instance
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`.
Co-authored-by: Ville Pietilä <>
[CK Tile] Add Tile Distribution Encoding Calculator (#5515)
## Motivation
We want to be able to calculate TileDistributionEncodings describing
register mappings for any MmaOp. This is necessary for further
integration with CK Tile.
This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma
type (MmaOp) and provides ABC warp distribution encodings for mapping
matrix fragment coordinates to register coordinates
(lane, vector item) and vice versa. It is able to take CTranpose,
Swizzle, and NumAccessA / NumAccessB template parameters for tweaking
the tile distributions. Swizzle modification will be implemented later.
The current implementation can deal with all intrinsic types and
block-hiding.
This MR also adds some additional static asserts and derived params
within amdgcn_mma_base, to enforce consistency and help calculate Tile
Distributions for block-hiding intrinsics.
An Example was added that uses the Tile Distr Enc Calc to calc and print
register layouts for Tile Distributions for some of our amdgcn_mma
structs. It also makes sure that the CTranspose modifier works as
intended.
Some additional gfx9 intrinsics were added to test block-hiding layouts
for the different types of C-block-hiding layouts.
The sparse intrinsic wrappers were updated according to Chris's recent
changes in another branch
(https://github.com/ROCm/rocm-libraries/pull/5508), which moved the
compression step outside of the intrinsic itself. This is necessary to
make sure that the Calculator can deal with this new interpretation of
the sparse intrinsics. I directly copied the new amdgcn structs from
Chris's branch and changed nothing else to avoid more complex merges in
the future. Note that this means I did not update a bunch of related
sparse code since that would be a lot, and therefore I disabled
test_amdgcn_sparse_mma for now.
The amdgcn_mma_layout test was refactored a bit:
- The old register mapping utility was removed and its use was replaced
by the new TileDistrEncCalc
- More tests were added to test layouts for different types of
block-hiding and sparse intrinsics
- The Selector method was removed and the tests were split up over
target architectures, with each target arch having a direct list of
amdgcn structs to be tested. This ensures that we force specific tests
on specific architectures and makes sure that the selector doesn't
quietly do some workarounds like creating compound intrinsics.
## Test Results
Layout tests based on calculated tile distribution encodings pass on all
architectures. Calculator works for all currently added amdgcn structs,
which includes different types of block-hiding and sparse intrinsics.
Printed layouts from new example verified by eye. CTranspose modifier
tested for large set of intrinsics.
Add missing gfx1033 to gfx103 group definition in ck (#5141)
## Motivation
Resolving PyTorch build failures when enabling builds for gfx103X-all
family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033`
is the only failing architecture in the family and the failures point to
missing support in CK.
## Technical Details
PyTorch build fails with repeated error message
```
/__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
33 | wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
`gfx1033` is missing from the `__gfx103__` group which results in
`CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in
`gfx1033` to the missing files which should be the minimum fix to allow
torch builds to pass.
## Test Plan
Compile sample test file and target gfx1033
```
...
#ifdef __HIP_DEVICE_COMPILE__
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value");
#else
static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value");
#endif
```
## Test Result
Prior to the applying patch, compilation fails with `error: use of
undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'`
After applying patch, test file compiles successfully.
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[CK_TILE] Add pooling in tile_engine (#4469)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Add pooling in ck tile engine
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
[CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842)
## Motivation
Force padding for atomic_add bf16 C tensor to avoid memfaults.
## Technical Details
- add global atomic add for bf16 and enable them
- add padding for atomic add bf16 due to the lack of oob
- remove padding for not continous dims in conv for other cases
- minor bwd data conv fixes
## Test Plan
test_grouped_conv_*_tile
## Test Result
pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] fix clang lifetimebound errors with staging compiler (#5921)
## Motivation
The ROCm staging compiler (newer Clang) enforces
`[[clang::lifetimebound]]` annotations on methods that return references
or pointers to internal object data. Without these annotations, the
staging compiler emits compilation errors for container accessor methods
across the CK and CK Tile namespaces.
## Technical Details
Adds `[[clang::lifetimebound]]` to all reference/pointer-returning
accessors in core container types:
**`ck::` namespace:**
- `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()`
- `index_array` -- `operator[]`
- `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()`
- `IndexLookupTable` -- `operator[]`
**`ck_tile::` namespace:**
- `array` -- `get(i)`, `at()`, `operator[]`, `operator()`
- `static_array` -- `operator[]`
- `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()`
- `make_kernel()` -- parameter pack
Also removes the unused `instance_index` variable from
`batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing
accordingly.
## Test Plan
- Compile with the staging compiler to verify all lifetimebound errors
are resolved
- Existing tests pass unchanged -- the attribute is a compile-time
annotation with no runtime effect
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.