[CK] increase time limit for fmha_bwd tests to prevent
timeouts (#8241)
## Motivation
Observed a CI failure due to fmha_bwd test timeout which never happened
before. Going to increase the time limit for the test to prevent any
further CI failures.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_Tile] Add wmma_bf16f32_16x16x32_bf16 warp-gemm test
(#8035)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
Adds the warp-gemm unit test for `wmma_bf16f32_16x16x32_bf16`. **Stacked
on #8028** (the API change) and based on its branch, so #8028 shows the
isolated API diff and this PR shows just the test.
## Test
gfx125-guarded `WmmaBf16f32.ResidualPrecisionContrast`: computes `Y_bf16
= X_bf16·W_bf16 + R_fp32` via `WarpGemm::mac_downconvert`, compares
against an fp32 reference (within bf16 tolerance), and asserts it is at
least as accurate as the bf16-accumulate path — i.e. it demonstrates the
precision benefit of the fp32 accumulator (`C`) carried into the fused
bf16 down-convert.
Passes on gfx1250.
[CK Tile] Stream-K RDNA Support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Currently, CK Tile Stream-K only supports CDNA architectures. This
change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures.
## Technical Details
Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear,
and 3) tree. The linear and tree reductions require inter-workgroup
communication to a global flags buffer and a global partials buffer. To
ensure cache coherency, we use cache modifiers to skip cache levels that
are not visible to all workgroups. On CDNA architectures, scalar load
and scalar store instructions are available, which we use to read and
write to the flags buffer with appropriate cache skipping modifiers.
However, RDNA architectures do not support scalar store instructions, so
workgroups must use a buffer store instruction to write to flags.
Additionally, cache modifiers differ between CDNA and RDNA; they also
differ between RDNA3 and RDNA4. Given this information, the main changes
are as follows:
- Added RDNA flag signaling: Use buffer store instructions for writing
to global flags buffer
- Add appropriate cache modifiers for reading and writing to flags and
partials:
- RDNA3 (gfx11): Use `glc | dlc` coherence flags
- RDNA4 (gfx12): Use `DEVICE` coherence scope
- SFINAE-guarded overloads: Added compile-time dispatch for
`SignalStorePartialDone()` and `WaitStorePartialDone()` based on target
architecture
- RDNA alignment requirements: Increased flags buffer alignment from
128B to 256B due to RDNA cache line size
**A note about the `amd_buffer_coherence_enum`:**
- **Problem:** The `amd_buffer_coherence_enum` uses preprocessor
conditionals (`#if defined(__gfx12__)`) to define architecture-specific
values. Template specializations reference enum values from different
architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name
lookup, non-dependent names are resolved during template parsing
regardless of which architecture is being compiled, causing compilation
failures when referenced values do not exist in the active preprocessor
branch.
- **Temporary Solution**: Added compatibility enum values to each
architecture block. For example, I added `glc_dlc` in the `__gfx12__`
block. I will create a ticket to refactor this enum with a design that
has better scalability and tries to avoid the use of preprocessor
conditionals.
## Test Plan
### Summary
gtests were added to test wmma variants of Stream-K. These tests were
stressed tested locally on gfx11 and gfx12.
### More details
This PR makes the following changes/additions to the Stream-K gtests:
- Split tests into MFMA (CDNA) and WMMA (RDNA) variants
- Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction
- WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size
supported on RDNA)
- Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU
count
- As described in [HIP
documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898),
when in Workgroup Processor (WGP) mode, the value of
`hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single
WGP contains two CUs. The default mode on RDNA is WGP mode, so when
creating (M, N, K) instances for gtests using the CU count, we need to
multiply the CU count by 2 to get the correct value. This is not needed
in the kernel host code, because the occupancy ensures that overall
`max_active_wgs` is correct.
## Test Result
All tests pass locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Extend type support EightWave pipeline
## Motivation
EightWave pipeline was designed for 8 bit types. This PR extend support
for any FP type
## Technical Details
- Generalize policy to support any FP type
- Change LDS layout to fix bank conflicts. This removes all bank
conflicts in the pipeline (checked for all supported types). Remaining
bank conflicts are related to Cshuffle epilogue.
## Test Plan
Added GEMM tests with new supported types. Note that FP6 is also
supported for MX GEMM but the PR was reverted so no tests were added for
it.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
=?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
This PR adds two specialisations related to `tf32_t`.
## Technical Details
This change treats `tf32_t` as a concrete type rather than an empty
`struct`. It also adds two new specialisations for MFMA dense builtins
and resolves existing circular include issues.
## Test Plan
All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] [QuantGEMM] Fix SplitK tail handling and other
improvements (#7199)
This pull request introduces improved and more robust split-K support
for quantized GEMM. The main changes add runtime validation, utility
functions for split-K batch calculations, pointer offset handling for
split-K in grouped kernels, and enhanced support for various tensor
layouts. The changes also improve error handling and provide more
flexibility for runtime tail handling in split-K pipelines.
**Split-K Support and Validation Enhancements:**
* Added runtime validation to ensure `k_batch` is a positive integer and
that split-K configurations do not produce empty final batches or
mismatched pipeline tails, with detailed error messages and logging for
misconfiguration.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250)
* Introduced utility functions `get_splitk_batch_k_read` and
`get_splitk_last_batch_k` to compute per-batch K read sizes and handle
split rounding, ensuring correct and consistent split-K batch
partitioning.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R206-R234)
[[2]](diffhunk://#diff-635b89bdffa96b2b42f1632520cde36701d7d631e864185591f6b32f7645cf47L104-R107)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L388-R417)
[[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250)
* Changed the default value of `k_batch` in `QuantGemmHostArgs` to 1 (no
split-K) for safer default behavior.
**Pointer Offsets and Grouped Kernel Handling:**
* Updated `QuantGroupedGemmKernel` to apply split-K per-batch offsets to
all input pointers, mirroring the behavior of non-grouped kernels and
ensuring correctness for split-K launches.
* Modified AQ tensor view handling to correctly reflect the remaining
K-groups from the split-K batch's offset position, improving accuracy
for split-K in grouped kernels.
**Pipeline and Layout Flexibility:**
* Added support for runtime selection of split-K tail handling via a new
template parameter `RuntimeSplitKTail_`, with new helper methods to
dispatch GEMM pipelines accordingly.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R273)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1496-R1567)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1427)
[[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1447-R1629)
[[5]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1459-R1641)
* Improved handling for tensor layout cases, including preshuffled B and
both row-major and column-major AQ layouts, ensuring correct pointer
arithmetic and compatibility checks.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R438-R454)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L464-R516)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)
[CK Tile] Async support preshuffle GEMM
## Motivation
Add async support to existing preshuffle GEMM pipeline
## Technical Details
Notes:
the implementation avoids previous strategy of duplicating pipelines for
async support and instead add a switch `Async` to the ops Problem to
enable async pipeline. Then, integrate the async pipeline in the
existing one. This allows to avoid code duplication and facilitate the
integration of buffer load to lds in existing pipelines. In my opinion,
it should be used also for other pipelines which don't support buffer
load to lds yet and it would also be a good idea to refactor the
existing async GEMM pipelines with the same approach.
Summary:
- integrate buffer load to lds in existing pipeline
- add optimal tensor descriptors for vmem loading and lds reading. They
are currently optimized for 16x16 wave tiles but they also work for
32x32 wave tiles. Optimizations for 32x32 wave tile requires different
lds layout and it will be done in a follow-up issue
- Add async config to examples
- Add test (gfx950 only)
## Test Plan
New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async`
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] PermuteN support MX GEMM
## Motivation
Add PermuteN support to preshuffle MX GEMM
## Technical Details
- Modify `shuffle_b_permuteN` to support MX preshuffled layout
- Add `preShuffleScalePermuteN` with same functionality of
`preShuffleScale` but layout consistent with PermuteN
- Include MX pre-processing functions in the library
## Test Plan
Add test configuration for permuteN with preshuffle (both FP4 and FP8)
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Cong Ma <congma13@amd.com>
[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility (#7829)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
CK source files must be compilable via **hipRTC (HIP runtime
compilation)**, whose preprocessor does not accept non-ASCII bytes
anywhere in a translation unit — **including in comments**. Bytes that
are harmless under `hipcc` (em-dashes, smart quotes, multiplication
signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at
preprocessing time. These regularly leak in via LLM-assisted authoring
or copy/paste from formatted documents and silently break hipRTC paths
that are not exercised by the default `hipcc`-based build matrix.
This PR (a) cleans every existing violation (53 files) and (b) adds a
pre-checkin gate so new violations are rejected before merge.
## File extensions covered
Both the cleanup scan and the new Jenkins enforcement stage use the same
predicate:
```
*.h *.hpp *.cpp *.h.in *.hpp.in *.cpp.in *.inc *.cl
```
(excluding `*/build/*` and `*/include/rapidjson/*`). This is a strict
superset of the existing `Clang Format` stage's predicate — `*.inc` is
added so test-fixture include files are also gated. The local pre-commit
hook's `c++/inc` type filter covers the same set.
## Why no enforcement today
CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so
the existing `pre-commit` workflow doesn't touch CK. The local CK
`.pre-commit-config.yaml` only runs for developers who installed hooks.
The **authoritative gate is therefore the new Jenkins stage** in this
PR; the local hook is convenience.
## Commit layout (bisect-friendly)
1. `79798aa6261` — **`[ck] Convert reflect/ rendering to ASCII for
hipRTC compatibility`**
Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `|-
/ +- / | ` (3-col width preserved so alignment is unchanged).
`conv_description.hpp` swaps `×` for `x` as the dimension separator.
`test_conv_description.cpp` expected strings updated in lockstep so the
snapshot test stays green. This is the only commit in the series with
observable runtime impact.
2. `738fdb0d81c` — **`[ck] Strip non-ASCII bytes from C++ sources for
hipRTC compatibility`**
Mechanical text cleanup across 53 files. Replacements happen in comments
or in `std::cout` strings that are not asserted on by any test. None of
the 174 `.inc` files in the tree required edits, but they were in the
scan's predicate so the enforcement stage's predicate is a superset of
what was scanned. Full replacement table in the commit message.
3. `1d7cd8ba235` — **`[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility`**
- New `projects/composablekernel/script/check_ascii_only.sh` (modeled on
`check_copyright_year.sh`).
- New entry in `projects/composablekernel/.pre-commit-config.yaml` under
the local-hooks block (`types_or: [c++, inc]`).
- New `ASCII Only Check` parallel stage in
`projects/composablekernel/Jenkinsfile`'s `Static checks` block,
mirroring the existing `Clang Format` stage but with `*.inc` added to
the find predicate. Always-on, no `RUN_CPPCHECK` gate.
The tree is buildable at every commit boundary. Commit 1 leaves 50 known
violations; commit 2 leaves 0; commit 3 wires the gate.
## Demo
Script output on a synthesized violation:
```
$ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp
$ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp
ERROR: /tmp/bad.cpp contains non-ASCII bytes:
1:// em-dash test — here
Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.)
$ echo $?
1
```
Full repo scan after the cleanup commits (note the `-name '*.inc'`
clause):
```
$ cd projects/composablekernel && find . -type f \( -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \
-o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' \) \
-not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \
| xargs -0 -P 8 -n 64 script/check_ascii_only.sh
$ echo $?
0
```
## Test plan
- [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check`
stage runs green over the full predicate (incl. `*.inc`) and existing
`Clang Format` stage is unaffected.
- [ ] `test_conv_description` passes against the ASCII tree-formatter
output (touched in commit 1).
- [ ] Local: `pre-commit run ascii-only-checker --all-files` runs
cleanly after installing CK pre-commit hooks via
`script/install_precommit.sh`.
- [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file,
push: confirm Jenkins fails the new stage with a clear error.
- [ ] Spot-check a representative subset of touched files under hipRTC
compilation to confirm no remaining hipRTC-blocking content (optional,
since the static byte check is a sufficient condition for hipRTC
preprocessor acceptance on this dimension).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
[GFX1250][CK_TILE] Add scale16 warp gemm unit tests
## Summary
- Add scale16 WMMA intrinsic overloads and int64_t forwarding to warp
gemm layers for gfx1250
- Add comprehensive wave-level unit tests for scale16 warp gemm
(16x16x128 and 32x32x128 tile sizes)
- Test all fp8/bf8 type combinations and TransposeC variants
- Fix WarpGemm wrapper for non-uniform scale16 configurations
Stacked on #7724 (FillUniformScaleDistribution / MX GEMM scale init).
Pipeline enablement follows in the next PR.
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
=?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
This PR solves several issues:
#### More accurate tests for MmaPipelines
The current tests for the MmaPipelines (test_amdgcn_sparse_mma,
test_amdgcn_wavewise_mma) use explicit input fragment vectors filled
with 1s, and only check the output of a single lane. We should have
tests that actually use the MmaPipelines with non-trivial input matrices
and verify the complete output.
Some other aspects of the current MmaPipelines tests that I noticed and
deserve some attention:
1. There is sometimes iteration over K outside of the pipeline, which is
then included in WaveTileK or FragK, which is not correct. We should
remove it, move K iteration inside of the pipeline, or be more clear
about this outer-K loop size and how it propagates downwards.
2. There is very tight coupling between the kernel, gtest code, and
test_pipeline helper, requiring a lot of information and functions to be
passed back and forth.
3. The test_pipeline helper is doing a bunch of register-related logic
on the host (related to point 1)
4. Without this register logic the only thing it does is check the
device, call the kernel, and check the output, but with a lot of
boilerplate.
#### Test helper for detecting target arch at HOST runtime
There is a really apparent issue we faced while writing tests:
Scenario:
1. Compile a test that supports both gfx950 and gfx1201 for gfx950
2. Run the test on a server that only has gfx1201 GPU
Actual:
Segmentation fault
Expected:
The test can correctly detect from HOST runtime that the DEVICE
target_id was different and skips the test.
Notes:
The only way of detecting the COMPILER_TARGET_ID in the existing "arch"
framework is launching a kernel and calling `get_compiler_target()` (so,
from a DEVICE code). This will create a segmentation fault if the
current arch differs from the target arch. To cope with this issue, we
propose to export the compiler target(s) (note they can be many) through
`projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and
define a test helper to deal with such cases.
#### Add composition support to Transforms
We have a small number of Transforms which act on MmaOp input and output
data, before and after the MmaOp call respectively. These are currently
implemented to work on an MmaTile level, but in theory they are also
supposed to work at a WaveTile level, i.e. after composition of multiple
MmaTiles to create larger effective MNK dimensions. Currently the
composed MmaTiles look like 2D C-style arrays of the individual MmaTile
level register vectors (see WaveWiseMmaPipeline). The transforms should
be able to take these and perform the proper transforms to the whole
WaveTile at once. This might allow for better performing
transformations.
Note: This PR handles the SparseTransform case and if we don't end up
doing scale as a transformation, there isn't really much left to do. If
we end up having only the sparse transform as a non-trivial transform,
then we could also consider removing the Transform framework.
[CK Tile] Fix V6 pipeline applicability and split-image
initialization (#7936)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
After adding code generation via CK Tile Dispatcher, some fwd and bwd
weight tests for CK Tile convolutions are failing. This PR introduced
correct applicability checks and fixes the split-image parameter
initialization such that non-applicable instances are not invoked during
test execution and split-image instances are correctly initialized.
## Technical Details
Investigation revealed two distinct problems
1. For bwd weight, the compute V3 uses prefetch of 3 distinct tiles,
which works incorrectly when the number of K-slices addressed by the
workgroup is 1. This occurs when a large split-K value is used for a
problem that results in a small Gemm-K value.
2. For fwd direction, the current CK Profiler/test infrastructure
doesn't initialize the split-image parameters for instance where
split-image is enable. Uninitialized split-image values result in
non-deterministic behavior where the tests might randomly fail.
Fixed problem 1. by adding a check in `IsSupportedArgument` that marks
the instance invalid if the `num_loops = ceil(GemmK / (k_batch *
KPerBlock)) < 4` for V6 pipeline kernel instances. The check is
compile-time eliminated for other kernels.
Fixed problem 2. by adding initialization of split-image parameters when
split-image is enabled. The default initialization corresponds to full
image with no split, i.e., the number of splits is 1 and it has the size
of the full image.
Added unit tests for the added logic.
## Test Plan
Running the following test suites cover the logic added in this PR
- test_grouped_convnd_fwd_tile
- test_ck_tile_grouped_conv_fwd
- test_grouped_convnd_bwd_weight_tile
- test_ck_tile_grouped_conv_bwd_weight
All test suites above are included in the automated test runs.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE][FMHA] Optimize long-context decoding on gfx11/12
(#7500)
## Motivation
Relevant issue: ROCM-22065
FMHA has less-than-optimal performance of long-context decoding (i.e.
when seqlen_q = 1) on gfx11/12.
This PR optimizes the splitkv pipeline and configs for such scenarios.
## Technical Details
Optimizations applied in this PR:
1. use tiles with smaller M0 (16 vs 64), these tiles are used when
seqlen_q <= 16
2. adapt qr_nwarp_sshuffle pipeline for gfx11, it allows to use more
warps even for M0 = 16 (the qr pipeline parallelizes work between warps
in M dim so with M0 = 16 it allows to use only 1 warp)
3. enable kMergeNumHeadGroupsSeqLenQ (an optimization that merges one
group of heads in GQA) for all hdim values, not only 128
4. increase the number of splits (multiply by the number of head groups)
if (3) is used
5. increase the number of splits for RDNAs (`multiProcessorCount` is the
number of WGPs on RDNAs, not CUs, so it should be doubled to have
meaning similar to CDNAs)
Performance on gfx1151:
| Case | develop (GB/s) | This PR (GB/s) |
|:-------|-------:|-------:|
| [fp16\|group\|bshd] b:1, h:32/32, s:1/45056, d:64/64 | 127.58 | 183.11
|
| [fp16\|group\|bhsd] b:1, h:32/32, s:1/45056, d:64/64 | 153.64 | 215.02
|
| [fp16\|group\|bshd] b:1, h:16/8, s:1/77184, d:128/128 | 120.51 |
225.76 |
| [fp16\|group\|bhsd] b:1, h:16/8, s:1/77184, d:128/128 | 130.62 |
223.84 |
| [fp16\|group\|bshd] b:1, h:32/32, s:1/9600, d:128/128 | 82.65 | 138.44
|
| [fp16\|group\|bhsd] b:1, h:32/32, s:1/9600, d:128/128 | 105.75 |
220.45 |
| [fp16\|group\|bshd] b:1, h:8/1, s:1/401024, d:256/256 | 16.27 | 187.89
|
| [fp16\|group\|bhsd] b:1, h:8/1, s:1/401024, d:256/256 | 16.28 | 188.19
|
## Test Plan
An additional test case is added to the exiting test. It uses seqlen_q =
1, GQA, no mask to trigger the changes
```
ninja test_ck_tile_fmha_fwd_fp16 && bin/test_ck_tile_fmha_fwd_fp16 --gtest_filter="*SplitKV*
ninja test_ck_tile_fmha_fwd_bf16 && bin/test_ck_tile_fmha_fwd_bf16 --gtest_filter="*SplitKV*
```
Manual testing can be done with these commands:
```
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=64 -s=1 -s_k=$((352 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=16 -h_k=8 -d=128 -s=1 -s_k=$((603 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=128 -s=1 -s_k=$((75 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=8 -h_k=1 -d=256 -s=1 -s_k=$((3133 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
```
## Test Result
All the tests must pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_Tile][MI450] Add bf16 output wmma instruction (16x16x32)
(#7830)
Wire __builtin_amdgcn_wmma_bf16_16x16x32_bf16 into CK Tile for gfx1250,
enabling bf16-input bf16-output WMMA at the warp GEMM level.
- Add WmmaTraits specialization for <gfx125_t, bf16, bf16, bf16,
16,16,32>
- Add WarpGemmAttributeWmmaImpl typedef and WarpGemmWmma alias
- Add Dispatcher entry for bf16->bf16 16x16x32
- Add warp_gemm test with reference GEMM validation
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_Tile] Add scale16 Support for F4 WMMA in CK_Tile
## Motivation
This PR adds CK Tile support for the scale16 F4 WMMA path on gfx1250 and
improves warp GEMM unit test coverage/structure for gfx1250-specific
cases.
## Technical Details
- Scale16 support in warp GEMM dispatch and WMMA trait plumbing: added
IsScale16 plumbing to warp GEMM dispatcher path
- Warp GEMM test restructuring for gfx1250: added Warp GEMM gfx1250
coverage to verify all F4 WMMA paths
## Test Plan
Run ./test_ck_tile_wg_32x16x128_fp4.
## Test Result
```
./test_ck_tile_wg_32x16x128_fp4
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1751 ms total)
[ PASSED ] 3 tests.
```
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Fix Stream-K k_size calculation
## Motivation
In a recent benchmarking task for CK Tile Stream-K algorithm, we
identified that certain instances segfault. This change works to fix the
bug and adds necessary regression tests.
## Technical Details
The StreamK kernel constructs tensor views using a `k_size` parameter
that determines how much of the K dimension to process in each
iteration. Previously, this was calculated as:
```cpp
index_t k_size = num_loop_sk * TilePartitioner::KPerBlock;
```
This calculation assumes all macro tiles along K are exactly `KPerBlock` in size. However, when `K % KPerBlock != 0`, the final macro tile along K has a remainder size of `K % KPerBlock`, not a full `KPerBlock` (see the figure below):
<img width="961" height="488" alt="image" src="https://github.com/user-attachments/assets/3e1cceed-5dcd-4980-8b02-cee24eecf262" />
With the old code, a workgroup working with the `MPerBlock x (K % KPerBlock)` tile in A and B risk accessing illegal memory.
Hence, this change ensures that when `K % KPerBlock != 0`, workgroups processing iterations that include the final macro-tile along K calculate the correct `k_size` based on the remainder rather than assuming a full `KPerBlock`.
## Test Plan
I added the following tests:
1. Unit tests added for the Stream-K Tile Partitioner:
- `StreamKTilePartitionerBaseGetKSize/NoRemainderTiles` - validates full tiles
- `StreamKTilePartitionerBaseGetKSize/RemainderTiles` - validates remainder handling
2. Regression tests that test a case where `K % KPerBlock != 0`
## Test Result
Tests passed locally on gfx90a, gfx942, and gfx950.
## Submission Checklist
- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
ck_tile: add FillUniformScaleDistribution and fix MX GEMM
scale init (#7724)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
### Problem
MX GEMM pipeline tests were passing vacuously: scale bytes were drawn
from a fixed range (40–60) which, for e8m0, maps to scales ≈ 10⁻²⁷ — far
below FP16 min denorm. Both GPU and CPU produced all-zero outputs, so
numerical checks passed without exercising the GEMM.
### Changes
**`include/ck_tile/host/fill.hpp`** — new
`FillUniformScaleDistribution<ScaleType>` functor
- Accepts human-readable float bounds and maps them to the raw byte
range of any ExMy scale type (e8m0, e4m3, e5m3) by re-centering the IEEE
754 exponent into the type's bias space
- Sampling is uniform over raw bytes → uniform over representable values
- Fixes left-shift UB: uses multiplication instead of `<< mant_bits` to
avoid shifting negative signed integers (C++17 UB)
- Adds `assert(min_r <= max_r)` to catch inverted-range UB when both
bounds exceed the type's representable range
- Provides default member values (0.125f, 2.0f) and `std::optional` seed
consistent with sibling fillers
- `/** */` Doxygen style with `@note` on snapping asymmetry
**`test/ck_tile/gemm_mx/test_mx_gemm_pipeline_util.hpp`** — fix scale
initialization
- Replace manual byte-range distribution with
`FillUniformScaleDistribution<>{0.125f, 2.0f}`
- Use distinct seeds for scale_a (11941) and scale_b (11943) to avoid
correlated scale tensors that were causing 60 test failures for
fp4+e5m3/e4m3 combinations
**`test/ck_tile/utility/test_fill.cpp`** — new unit tests for
`FillUniformScaleDistribution`
- 16 typed tests across e8m0, e4m3, e5m3: validity, range,
reproducibility, coverage, snapping, stress, nullopt seed, and range
overload
- Test helper `expected_raw_range` mirrors implementation clamping
exactly
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC
## Motivation
Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async
load instruction for tensor A, which complements the existing MX GEMM
FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines
are enabled on gfx1250.
## Technical Details
The existing gfx950 implementation was extended to support gfx1250
architecture. All three MX FP data types are supported across the two
ASICs.
It should be noted that while the TDM pipeline uses an emulated
32x32x128 warp-tile instruction, the present submission relies on the
built-in 16x16x128 instruction, called 4 times per warp.
## Test Plan
Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250
functionality.
To help facilitate the testing in development,
`example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was
introduced to verify various combinations of supported data types and
pipeline versions.
## Test Result
The present submission is expected to work on both gfx950 and gfx1250
hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
- [x] Relies on #6978 and should only be merged after the changes are
merged to the `develop`.
[CK] suppress compiler warnings while building pytorch. (#7760)
## Motivation
Recently added compiler flags that are required to suppress false
warnings by latest staging compiler are not recognized by older compiler
versions and are triggering an avalanche of warnings. Previous attempt
to suppress them by using -Wno-unknown-warning-option flag didn't help,
because that flag wasn't recognized either and just added more warnings.
I've verified that current approach by checking the clang version
actually works as intended and makes the warnings go away.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Stream-K XCD remapping (#4279)
## Proposed changes
This PR adds support for XCD remapping as detailed in this
[document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A).
On gfx942, workgroups are typically scheduled round-robin across XCDs,
which can lead to poor locality. We will use a remapping to assign
workgroups to contiguous tiles in the XCDs improving the locality and
the cache hit rate. This is done through a function that computes this
contiguous mapping from this
[PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have
added to the StreamKTilePartitioner. This will require minimal changes
to the Stream-K algorithm, only requiring a remap at the time the
workgroups are partitioned. Through this approach we can improve the
data locality by improving cache hits therefore closing performance gaps
that are seen with the default scheduling. There have been unit tests
added to verify the function in isolation. This is an optimization that
is not specialized to just Stream-K GEMM and can be applied across GEMM.
Note: This only applies to the gfx942 as they introduce the XCDs.
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged
---
🔁 Imported from
[ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652)
🧑💻 Originally authored by @arai713
---------
Co-authored-by: Astha <astha.rai713@gmail.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>
[CK TILE] Unification Work – Add MFMA specialisations for `fp64_t` (#7104)
## Motivation
This PR adds two specialisations related to `fp64_t`.
## Technical Details
This adds two new specialisations for MFMA dense builtins, and adjusts
ABLayout and CLayout to L{K1BM} and L{M1BN}.
## Test Plan
All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Add gemm universal preshuffle to MX GEMM (#5647)
## Motivation
Add gemm universal preshuffle support to existing MX GEMM pipeline.
The straightforward way to do this is to port the `mx_flatmm` pipeline
to the existing `gemm_mx` framework.
## Technical Details
The `mx_flatmm` pipeline was not deleted, to allow for
back-compatibility.
## Test Plan
Add `preshuffle` option to example: `tile_example_mx_gemm`.
Add new configurations with enabled preshuffle to the existing
`test/ck_tile/gemm_mx` tests.
## Test Result
Example and tests were successful on `gf950` architecture in the `Alola`
cluster.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
[CK Tile] Adding WMMA wrappers for sparse builtins (#6567)
## Motivation
This PR is part of the [WMMA/MFMA] unification work. It's the third of
the series of PRs (after
https://github.com/ROCm/rocm-libraries/pull/5801 and
https://github.com/ROCm/rocm-libraries/pull/6014) that add all the
necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse
WMMA intrinsics.
## Technical Details
This change adds new specializations for WMMA sparse builtins. In total,
we add 8 WMMA builtins.
## Test Plan
All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.
## Test Result
Test pass locally, waiting for the CI.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] upgrade CI to rocm7.13 as default compiler (#7612)
## Motivation
Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
[CK Tile] Fix Grouped Gemm quant mixed precision (#7537)
<Migrate from Internal repo PR>
test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8
cases:
std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant,
False, True, False>,
std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant,
False, True, False>
GFX1250 would fail with incorrect results, GFX950 would fail when
compiling BF8+FP8 and give incorrect results for FP8+BF8.
The issue is due to the wrong ComputeDataType selection.
The fix is to consider original ADataType and BDataType even when
ComputeDataType is not void. For compiling error on gfx950, the bf8,
fp8, 16x16x32 warp Gemm is added.
[CK Tile] Adding MFMA wrappers for dense builtins (#6014)
## Motivation
This PR is part of the [WMMA/MFMA] unification work. It's the second of
the series of PRs (after #5801) that add all the necessary MMA builtins
as `amdgcn_mma` structs. This PR focuses on dense MFMA intrinsics.
## Technical Details
This change adds new specializations for WMMA dense builtins. In total,
we add 55 MFMA builtins.
## Test Plan
All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.
## Test Result
Test pass locally, waiting for the CI.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Support multi-vector reads in static encoding patterns (#7528)
## Motivation
The thread-raked / warp-raked / block-raked static tile distribution
patterns in `ck_tile` silently produce wrong results when the contiguous
tile dimension is larger than `warp_size * vector_size`, because the
encoding has no per-thread iteration dimension along X.
Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in
`ConvConfigComputeV3`, the grouped convolution backward-weight example
reports about 50 percent wrong values, with errors starting exactly at
the `X0*X1 = 64` boundary. The second pass over the contiguous dim is
never performed.
This PR extends the encoding so multi-vector reads in the contiguous
tile dimension are supported, while keeping every existing call site
bit-for-bit identical.
## Technical Details
Three files changed.
### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp`
Add a per-thread X iteration dimension in all three raked
specializations:
- `X0 = min(warp_size, XPerTile / X1)` — threads in X dim
- `X1 = min(LargestVec, VecSize)` — vector size per access
- `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new)
`X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both
`make_2d_static_tile_distribution()` and
`make_shuffled_2d_static_tile_distribution()`.
The new encoding places `X2` in the middle of the Ys iteration list,
which preserves reverse symmetry between the regular `<..., X2, X1>` and
shuffled `<X1, X2, ...>` encodings.
Patterns updated: `thread_raked`, `warp_raked`, `block_raked`.
### 2. `include/ck_tile/core/tensor/transpose_tile.hpp`
Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch
alongside the existing `NDimY == 2` branch. The original branch is
byte-for-byte unchanged.
Both branches dispatch to the same `transpose_tile2d_impl_in_thread`,
whose body has always been NDimY-generic (iterates with `static_for<0,
NDimY, 1>` and `number<NDimY>{}`).
### 3.
`experimental/grouped_convolution_tile_instances/generate_instances.py`
Removed the two now-obsolete skip guards in `parse_bwd_weight_instances`
and `parse_bwd_data_instances`:
```python
if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector):
print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.")
continue
```
Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating,
irregular-load shapes, scalar-per-vector > tile size) are kept
untouched.
### Compatibility
Strict. Every existing caller has `X2 == 1` and therefore hits the
original encoding path verbatim. No upstream config or pipeline behavior
changes.
## Test Plan
The grouped convolution example is the natural exerciser since
`GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for
both A and B tiles, and all three conv directions share the same
`ConvConfigComputeV3`.
For each test below we ran:
```
./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}]
```
with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both
code paths:
| Test | M / N / K | VecA/B/C | A path | B path | dtype |
|------|-------------|----------|------------|----------------|-------------|
| T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 |
| T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 |
| T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 |
| T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)|
| T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3
dir) |
A larger T4a (256/256/128) was attempted to stress both A and B with
X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128
KB > 64 KB limit), independent of this PR.
For the generator change we ran:
```
python3 generate_instances.py --mode profiler --direction all
```
and verified `Skipping instance ... with multiple warps per continous
tile dim` no longer appears (count went from non-zero to 0); other skip
categories are unchanged.
`clang-format-18` was applied to both modified `.hpp` files (matches the
repo's `.clang-format`).
## Test Result
- T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`.
Confirms existing callers are unaffected.
- T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3
encoding + transpose branch.
- T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3
`correct`.
- T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs
`correct`.
- Generator: 0 `multiple warps per continous tile dim` skips remaining;
other skips unchanged.
Sample run output (T4b, bf16, bwd_data):
```
shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32
pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0
Vector size A: 1, Vector size B: 1, Vector size C: 1
0.934907 ms, 8.34683 TFlops, 34.3178 GB/s
Relative error threshold: 0.00390625 Absolute error threshold: 0.25
The CPU verification result is: correct
```
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
[CK Tile][MFMA/WMMA unification] Add support for packed datatypes (tiny types) (#6088)
## Motivation
This MR makes all the changes required for the unified architecture to
be able to deal with packed datatypes i.e. int4, fp4, fp6, and bf6. The
crux is that layout parameters should be interpreted as describing the
pure mathematical matrix fragments, while the ext_vectors and tile
distribution encodings describe everything in terms of packed datatype
units. This matches how packed types are dealt with in ck_tile and
should play nicely with the load and store tile ops once we integrate
the unified framework into CK tile.
The bf6 datatype was added to CK tile in the form of pk_bf6x16_t and
pk_bf6x32_t, which did not exist before.
The ext_vector implementations of pk_fp6x16_t and pk_bf6x16_t (vec size
1 and 2) were extended to make the subscripting operator work as
expected.
The layout test was adapted to be compatible with all packed datatypes,
and all new intrinsics were added to the test.
This MR adds ALL intrinsics across ALL architectures which use packed
datatypes, as well as ALL scale intrinsics:
mfma_scale_f32_16x16x128_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6,
BF6xBF6)
mfma_scale_f32_32x32x64_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6,
BF6xBF6)
wmma_i32_16x16x16_iu4_w32
wmma_i32_16x16x16_iu4_w32_gfx12
wmma_i32_16x16x32_iu4_w32_gfx12
## Testing
All intrinsics were tested on all architectures.
[CK Tile] Eight Waves pipeline for MX GEMM (#5552)
## Motivation
Integrate Eight Waves pipeline in MX GEMM
## Technical Details
- EightWaves pipeline:
- Add pipeline, policy and block gemm (internally using existing
implementation used by GEMM and ABQuant)
- Extend support of EightWaves policy for FP4 (packed types)
- Async pipeline:
- Fix pipeline with packed scales (requires MRepeat and NRepeat to be
contiguous)
- block gemm specific for MX GEMM is defined because distribution
encodings have changed
- CShuffle:
- Add new functionality to support MRepeat and NRepeat contiguous
(defined by `TilesPacked`)
- Examples:
- Refactor examples to easily switch different configurations (similar
to GEMM universal)
- Scales values generated consistently with other microscale
implementations in CK Tile
- Add configuration for EightWaves pipeline
- Tests:
- Unify existing FP8 and FP4 tests
- Add tests for EightWaves pipeline
- Scales values generated consistently with other microscale
implementations in CK Tile
Note: FP6 support for MX GEMM was added later and the support for the
Eight Waves pipeline will be done in following PR
## Test Plan
Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and
FP8
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Unification Work – Add `print()` Utility to `MmaOpTraits` (#6207)
## Motivation
It would be useful to have a `print()` utility inside of unification
work's code scope, so that we can print all template params and derived
params of `amdgcn_mma` for easier debugging.
## Technical Details
Adding helper functions and struct to traits, adding `print_flags()` for
each `Default*CtrlFlags`, `amdgcn_target` and `MmaOpTraits` structs, and
adding `print()` for `amdgcn_mma`.
Note: the first commit is **not** in the scope of this PR. This PR
should be merged after https://github.com/ROCm/rocm-libraries/pull/5801
and https://github.com/ROCm/rocm-libraries/pull/5857.
## Test Plan
Adding test in layout test.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] increase timeout limit for fmha_fwd tests to avoid CI failure on gfx11 (#7471)
## Motivation
This should prevent fmha_fwd tests from timing out on one of the slower
gfx11 CI nodes and generating false CI failures.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] add composable kernel support on gfx1250 (#6978)
## Motivation
Add composable kernel support on gfx1250.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
[CK] Suppress new staging compiler errors (#7384)
## Motivation
This should make new builds with staging compiler pass.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner (#7272)
## Summary
In `fmha_bwd_runner.hpp`, the `sink_host` `HostTensor` is allocated with
first
dimension `shape_batch` (= 1 in group mode), but the reference forward
loop
accesses `sink_host(wb, i_h)` with `wb ∈ [0, batch-1]`. For any `wb >=
1` this
is an out-of-bounds heap read, silently corrupting the reference forward
math
chain (`lse_host`, `o_host`) and turning the bwd-side `d_sink_head_acc`
reference into non-deterministic garbage.
`HostTensor::operator()` does not bounds check, so the OOB is not caught
at
runtime. This manifests as intermittent `tile_example_fmha_bwd` failures
(25–67% fail rate) when `-sink_grad=1` is combined with `-mode=1` (group
mode),
with bit-exact but spurious `max_err` values like 4.27 / 14.6.
## Fix
One-line: allocate `sink_host` with `batch` (the real per-batch dim)
instead of
`shape_batch`, mirroring how `sink_host` is accessed by the loop.
```diff
- sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead}
+ sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead}
Repro
tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \
-bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \
-v=3 -mode=1 -kname=1 -sink_grad=1
Verification
- 0/30 fail on the repro config after fix
- Baselines (before fix):
- sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4)
- sink=1, mask=t: 67% fail rate (p ≈ 6e-15)
Attribution
Shape bug introduced together with sink_grad in #5504. Unrelated to
#6914
(which is a fwd-only fix on a different code path)
```
## Submission Checklist
- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Signed-off-by: junlin12 <junlin12@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
[CK Tile] Adding WMMA wrappers for dense builtins (#5801)
## Motivation
This PR is part of the [WMMA/MFMA] unification work. It's the first of
the series of PRs that add all the necessary MMA builtins as a
`amdgcn_mma` structs.
## Technical Details
This change adds new specializations for WMMA dense builtins. In total,
we have now 9 RDNA4 builtins and 3 RDNA3 builtins.
## Test Plan
All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.
## Test Result
Test pass locally, waiting for the CI.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Yung-sheng Tu <yung-sheng@streamhpc.com>
[CK_TILE] Preserve input strides in EightWaves async-load descriptor (#6611)
`MakeAsyncLoadADramWindow` in
`GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view
descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes
strides from lengths and assumes a dense layout. When the input view's
leading-dim stride is larger than its inner length (non-packed memory
layout), the resulting tile window stepped through memory at the wrong
stride.
Compose the unmerge transforms on top of the input view's existing
descriptor instead, so the actual runtime strides are preserved and the
correct `element_space_size` is inherited for bounds checking.
## Test Plan
Added an unit test showing the problem.
## Test Result
The new test fails before fixes and passes after.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK TILE] Unification of Scale MFMA/WMMA Policy Structs (#5857)
## Motivation
The existing unification work supports DENSE and SPARSE intrinsics. In
this PR, we enable support for SCALE intrinsics and add example SCALE
implementations.
## Technical Details
Adding MFMA SCALE intrinsics support, adding tests for MFMA SCALE
intrinsics, and adding WMMA SCALE policy trait.
Note: fp6 SCALE intrinsics support is not included in this PR, as its
handling in ck_tile is currently more specialized and does not follow
the same pattern as other datatypes.
## Test Plan
Added new tests for the relevant SCALE specialisations.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863)
## Motivation
The PermuteN epilogue was previously embedded within
cshuffle_epilogue.hpp, despite having fundamentally different behaviour.
Coupling these two independent strategies in one file introduced
unnecessary complexity, SFINAE guards, and a dual operator() overload
selected at compile time via TiledMMAPermuteN_ template parameter.
This PR separates PermuteN into its own standalone
file(pertmuten_epilogue.hpp), simplifying both implementations and
making the codebase easier to maintain and extend independently.
## Technical Details
**New file: permuten_epilogue.hpp:**
contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from
the permuteN code path in cshuffle_epilogue.hpp.
**Cleanup of cshuffle_epilogue.hpp:**
- Removed the TiledMMAPermuteN_ template parameter from
[CShuffleEpilogueProblem]
- Removed the SFINAE-guarded permuteN operator() overload
- Removed the EnablePermuateN_ SFINAE alias
- CShuffle now only contains CShuffle logic; EightWave support
(independent feature) is retained
**Consumer migration :**
All consumer files now use compile-time epilogue selection via
[std::conditional_t]
`using GemmEpilogue = std::conditional_t<
TiledMMAPermuteN,
PermuteNEpilogue<PermuteNEpilogueProblem<...>>,
CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;`
**Files modified:**
- flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp,
mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples
- run_gemm_quant_example.inc — block-scale GEMM example
- gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker
- test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp,
test_gemm_pipeline_util.hpp — test utilities
- universal_gemm_invoker.hpp — universal GEMM invoker
- epilogue.hpp — add header updated to include permuten_epilogue.hpp
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
[CK_TILE] Restructure Tile Engine's benchmarking and profiling (#4769)
## Motivation
This PR introduces a restructure for the benchmarking and profiling
aspects of CK Tile's Tile Engine, expanding on the groundwork from this
previous https://github.com/ROCm/composable_kernel/pull/3434 and
outlined in this [design
document](https://amdcloud-my.sharepoint.com/:w:/r/personal/astharai_amd_com/Documents/Restructuring%20Tile%20Engine.docx?d=w14ea28a30718416988ed5ebb759bd3b2&csf=1&web=1&e=l3VBuX).
In PR 3434, to reduce repeated code we implemented:
- Base class that centralizes common functionality and provides a
default implementation (Universal GEMM)
- Child classes for GEMM variants override virtual functions to handle
variant-specific behavior
This refactoring in this PR follows the same process and should greatly
reduce the duplicated code present in Tile Engine and make it simpler to
add in new operations, increasing scalability.
## Technical Details
The files have been refactored around new base structs for benchmarks,
profiling and problem descriptions. The new base structs are:
- GemmProblem
- GemmBenchmark
- GemmProfiler
Universal GEMM, Preshuffle GEMM, and Multi-D GEMM all have child classes
that will inherit from these base structs overriding only what differs
per variant.
All common functions across the benchmarking and profiling files have
been moved into newly added common utility files under the commons/
directory. The new utility files are:
- utils.hpp: common functions for the benchmarking and profiling process
- benchmark_utils.py: common utility functions for the benchmark
generation
## Test Plan
I tested using the existing tests for Tile Engine.
## Test Result
All tests passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Skip fp16 dropout d256 batch tests for compiler VGPR aliasing bug (#6342)
## Summary
- Skip fp16 FMHA forward dropout tests that use the d256 tile in batch
mode, gated on compiler version
- The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of
Philox RNG parameters under high register pressure (383 VGPRs)
- bf16 dropout tests are unaffected and cover the same code paths
## Root Cause
The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state
stored in VGPRs) with other live data during the softmax main loop. This
causes corrupted `buffer_store_byte` writes for dropout randval on wave
lanes 32-63, producing NaN in output and LSE tensors.
**Conditions:** fp16 + d256 tile + dropout + batch mode + `qr` pipeline
+ gfx90a
## Changes
- `include/ck_tile/core/config.hpp`: Add
`CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro
- `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in
`TEST_P(Dropout, ...)`
## Test plan
- [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip
active)
- [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED
- [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)
[CK Tile] Unification work - mma transformations pipeline (#5508)
## Motivation
In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing.
For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform.
A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR.
The pipeline base class is basically at the RFC stage.
We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline.
## Technical Details
The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed.
## Test Plan
New tests are added.
## Test Result
Tests should pass.
---------
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
[MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400)
Reverts ROCm/rocm-libraries#6343
This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.
[CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343)
## Motivation
Due to compiler version update, there are test failures in the test
suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current `develop` branch, the test
failures are blocking any PR merges into `develop`. An example of a
failed CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as tests are passing for clang compiler version 20.0 and failing
for clang compiler version 22.0.
## Technical Details
This PR disables the compilation of the problematic bwd weight conv
instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that
allows us to detect when we are compiling for `gfx90a`. Using the new
CMake flag, compilation of instance
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`.
Co-authored-by: Ville Pietilä <>
[CK Tile] Add Tile Distribution Encoding Calculator (#5515)
## Motivation
We want to be able to calculate TileDistributionEncodings describing
register mappings for any MmaOp. This is necessary for further
integration with CK Tile.
This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma
type (MmaOp) and provides ABC warp distribution encodings for mapping
matrix fragment coordinates to register coordinates
(lane, vector item) and vice versa. It is able to take CTranpose,
Swizzle, and NumAccessA / NumAccessB template parameters for tweaking
the tile distributions. Swizzle modification will be implemented later.
The current implementation can deal with all intrinsic types and
block-hiding.
This MR also adds some additional static asserts and derived params
within amdgcn_mma_base, to enforce consistency and help calculate Tile
Distributions for block-hiding intrinsics.
An Example was added that uses the Tile Distr Enc Calc to calc and print
register layouts for Tile Distributions for some of our amdgcn_mma
structs. It also makes sure that the CTranspose modifier works as
intended.
Some additional gfx9 intrinsics were added to test block-hiding layouts
for the different types of C-block-hiding layouts.
The sparse intrinsic wrappers were updated according to Chris's recent
changes in another branch
(https://github.com/ROCm/rocm-libraries/pull/5508), which moved the
compression step outside of the intrinsic itself. This is necessary to
make sure that the Calculator can deal with this new interpretation of
the sparse intrinsics. I directly copied the new amdgcn structs from
Chris's branch and changed nothing else to avoid more complex merges in
the future. Note that this means I did not update a bunch of related
sparse code since that would be a lot, and therefore I disabled
test_amdgcn_sparse_mma for now.
The amdgcn_mma_layout test was refactored a bit:
- The old register mapping utility was removed and its use was replaced
by the new TileDistrEncCalc
- More tests were added to test layouts for different types of
block-hiding and sparse intrinsics
- The Selector method was removed and the tests were split up over
target architectures, with each target arch having a direct list of
amdgcn structs to be tested. This ensures that we force specific tests
on specific architectures and makes sure that the selector doesn't
quietly do some workarounds like creating compound intrinsics.
## Test Results
Layout tests based on calculated tile distribution encodings pass on all
architectures. Calculator works for all currently added amdgcn structs,
which includes different types of block-hiding and sparse intrinsics.
Printed layouts from new example verified by eye. CTranspose modifier
tested for large set of intrinsics.
CK: Extract shared boilerplate from 47 gemm_quant test files (#6323)
Depends on #6303
## Summary
Extract shared test boilerplate (includes, type aliases, test fixture
macros) from 47 `test_gemm_quant_*` files into a single
`test_gemm_quant_common.hpp` header. Each test file is reduced from ~50
lines of boilerplate to ~5 lines.
| Metric | Value |
|--------|-------|
| Files changed | 48 |
| Insertions | +413 |
| Deletions | −1,106 |
| **Net lines removed** | **−693** |
### What changed
| Before | After |
|--------|-------|
| 47 test files, each with ~50 lines of identical includes, type
aliases, and fixture macros | 1 shared header
(`test_gemm_quant_common.hpp`) + 47 thin files (~5 lines each: include +
params) |
### Readability assessment
A code realist review confirmed this change **improves readability**:
the 47 test files had identical boilerplate obscuring the only
meaningful content — the `GemmConfig` type alias and test dimensions.
After the refactoring, each file's unique configuration is immediately
visible, and adding a new test variant requires specifying only the
varying parameters instead of copying 50 lines.
### Cumulative cleanup series stats
| PR | Description | Net lines |
|----|-------------|-----------|
| #6300 | Remove 61 dead `#if 0` blocks | −2,648 |
| #6302 | Remove 41 commented-out dead code blocks | −2,861 |
| #6303 | Remove 4 orphaned files | −3,886 |
| This PR | Extract gemm_quant test boilerplate | −693 |
| **Total** | | **−10,088** |
CK: Remove 4 orphaned files with verified replacements (~1,025 lines) (#6303)
Depends on #6302
## Summary
Remove 4 orphaned files that have verified replacements already in the
build.
| File | Reason | Replacement |
|------|--------|-------------|
| `test_gemm_pipeline_compiler.cpp` | Refactored into 13 smaller tests |
`_compv3`, `_compv4`, `_mem`, `_persistent`, etc. |
| `test_grouped_gemm_quant.cpp` | Refactored into 5 smaller tests |
`_rowcol`, `_tensor`, `_aquant`, `_bquant`, etc. |
| `..._f8_f8_f16_..._comp_default_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
| `..._f8_f8_f16_..._comp_kpadding_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
Each deletion was verified:
- Original file is NOT in any CMakeLists.txt
- Replacement files ARE in CMakeLists.txt and actively compiled
- Content is fully covered by the replacement files