Commit Graph

1076 Commits

Author SHA1 Message Date
Qianfeng Zhang
fafb375122 Use packed cast_tile for fp16 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
6686c7af44 Update to partially reduce the register spilling 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
459c5565d4 Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier() 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
8a6c2591b0 Update the in pipeline codes 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
d360c61200 Fix in calculation of total_flops and update benchmark scripts 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
251136cca7 Add output of estimated TFLOPS 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
644ea27e0e Update to the scripts and error thresholds 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
2a71304bbb Tune the input initialization to avoid over-flow in silu 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
9c2dbf8d64 Add benchmark_hstu_attention.sh 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
cdb0704377 Add several verification test cases 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
beb6fa8cc1 Fix in kernel and forward dispatch for jagged mode 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
24822a4898 Fix in hstu-attention pipeline (which makes some testing cases passed) 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
50b0af257c Fixes and updates 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
72774b718b Change in HstBlockMasking and kernel/reference codes for using masking 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
74a0ec4609 Fix and change in example 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
450494945f Add hstu attention kernel implementation, instances and interfaces (building succeeded) 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
e6b6323b67 fix the jagged mode tensor access in reference_hstu_attention 2026-06-23 09:17:26 +00:00
Qianfeng Zhang
a19f73c305 Initial reference implementation of hstu attention 2026-06-23 09:17:23 +00:00
Enrico Degregori
2733e75900 [rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e)
[CK Tile] Async support pipeline V3

## Motivation

Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async
pipeline)

## Technical Details

- Add `Async` bool to `Problem` struct to enable async pipeline in
existing one
- Add `static_move_ys` to load transpose. This generates offset in
assembly instructions saving registers
- Add `is_valid` to `async_get_vectorized_elements`. Before hard coded
to true. It allows to support padding
- Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr`
(wider use of lds load transpose on gfx950)
- Integrate async support in existing V3 pipeline (avoid pipelines
duplication)
- Create policy to support both async and default cases. This could be
used by any async pipeline (next steps)
- Define `wg_attr_num_access` separately for A and B. This allows to
optimize ds_read instruction width for cases when one matrix is
transposed and the other is not. Before in such cases, `ds_read_b64` was
used instead of `ds_read_b128`
- Add test for V3 async. Currently only supporting cases with A and B
having the same type

## Test Plan

New test `test_ck_tile_gemm_pipeline_compv3_async`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 06:57:14 +00:00
Ville Pietilä
60b276647b [rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)
[CK Tile] Rule-based configuration generation in CK
 Dispatcher codegen (#8157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile Dispatcher code generation for CK Tile Profiler relies on
flat JSON files to list the generated configurations. This approach has
the following problems

- The JSON files are verbose
- The JSON files get easily out of sync with the CK Builder .config
files from which they were generated from.
- The JSON file based configuration make it hard to list explicitly the
rules that govern the instance generation.

## Technical Details

Replaced the JSON files with a rule based configuration. To preserve the
existing functionality, the `profiler` and the `tests` instance sets are
generated directly from the CK Builder config files. The JSON config
files are removed from source control, and the "on-the-fly" generation
guarantees that the Dispatcher codegen uses up to date configurations.

This is PR introduces six different rule sets for the CK Tile Dispatcher
code generation

1. `profiler`: matches with the old JSON set of profiler configurations.
2. `tests`: matches with the old JSON set of tests configurations.
3. `full`: full configuration set created from a rule-based config
selection
4. `full-tests`: a subset of `full` for generating configurations for
convolution integration tests.
5. `tiny`: a subset of `full-tests` to produce the minimal set of
configurations to test the Dispatcher codegen.
6. `default`: the default rules, which corresponds to the existing
heuristic rules for configuration selection. This ensures that ML based
kernel selection doesn't get broken.

The main use of the `full` rule set is to define a reasonable solution
space for the possible implicit GEMM configurations. We start from the
configurations that allowed by the device architecture. The `full` rule
set defines the relevant tile sizes for each convolution direction. From
the tile size we have a curated mapping to the number of waves over the
different GEMM axes, i.e., we describe how many waves each GEMM
dimensions corresponds to. The GEMM-K wave tile dimension can be
computed from the other parameters and does not need to be listed
explicitly.

An orthogonal axis to the tiling strategy is the vectorization strategy.
This mainly defined by the data type and hardware as in general, we want
to use the maximum possible load widths. The maximum sizes for each
convolution direction variant are defined by the implicit GEMM matrix
dimensions. For cases where have a low number of channels per
convolution group, we need smaller vector load sizes. These are captured
by the `VecStrategy` enumeration in the codegen rules.

The problem with the rule based configuration selection is that we "over
generate" configurations. The old JSON configurations compose
approximately 25% of all configuration that the `full` rule set creates.
The additional configurations are valid, but they many not provide any
performance benefits. Hence, we keep the `profiler` and `tests` rule set
for now to avoid building an excessive amount configurations by default.
The `full` rule set can be taken into use by specifying CMake
configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the
`tests` rule set is used, i.e., we don't change the existing bahaviour.

## Test Plan

Added a new stage in the CI/CD pipeline that ensures the Dispatcher
codegen rules are up to date. Otherwise the functionality is covered by
the existing CI/CD tests. There are no functional changes to the
convolution kernels. Only how the different instances are generated.

## Test Result

If the CK Tile conv instances build without errors, the Dispatcher
codegen is generating valid code. If all tests in CI/CD pipeline are
passing, the Dispatcher codegen generates valid instances.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 01:22:50 +00:00
damien-lejeune
5bebfd460f [rocm-libraries] ROCm/rocm-libraries#8492 (commit 46b6a06)
Add tile size for FMHA batch prefill bf16 for MI308X

## Motivation

Adding a tile size adapted to MI308X, for the FMHA Batch Prefill BF16
input type case

## Technical Details

N/A

## Test Plan

Benchmarking from the Aiter side with:

```
python3 op_tests/test_batch_prefill.py  -s 8000 -p 1 -q 4 -k 1 --head_dim 256 -c true -d bf16 --input_dtype bf16 --quant_method none --kv_layout linear -t sglang -l 0.0 --return_lse false --profile
```

## Test Result

We see an improvement with the new tile size on MI308X (both with PLT
mode OFF and ON)

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-06-17 06:22:26 +00:00
Sami Remes
c1f7104852 [rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01)
[CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K,
 unsupported-shape guard (#6663)

Three independent MX GEMM correctness bugs reported against
example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one
host-side atomic-add accumulation bug in the example's repeat loop.

- Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop
required num_loop > PrefetchStages, which let num_loop == 3 enter a hot
loop that produced 5 gemm accumulations instead of 3 (K == 3*K_Tile,
e.g. K=768, deterministically wrong). Require num_loop >= 4 instead:
pre-pipeline + TailNumber::Three already totals exactly 3.

- Kernel (gemm_mx_kernel.hpp): split-K was silently broken because
GridSize did not thread k_batch into blockIdx.z and the scale tile
windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore
read the wrong packed scales. Fix:
* GridSize returns dim3(grid_x, 1, k_batch) (persistent and
non-persistent).
* MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a
packed-scale K offset (also apply pad_tensor_view so OOB scale loads
return zero, matching A/B padding).
* operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset
(matches Underlying::SplitKBatchOffset's K1-aligned formula), and
dispatches the epilogue with memory_operation_enum::atomic_add for
k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard
as UniversalGemmKernel.
* MakeCBlockWindows templated on DstInMemOp; unconditionally applies
pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are
handled correctly.

- Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp):
add IsSupportedArgument and a static_assert for configurations that
produce silent wrong results:
* static_assert(!kPadK) -- the MX comp-async pipeline uses
async_load_tile whose OOB check is per-vector-start, so a vector
straddling the K pad boundary reads garbage. Until the async path learns
per-element pad masking, reject kPadK at compile time.
* Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when
kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always
(CShuffleEpilogue cannot safely run with a single partial tile); K %
(KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of
WarpTile_K * k_batch so every split lands on a packed-scale boundary.
  * All error paths log under CK_TILE_LOGGING with actionable messages.

- Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp):
* Call Kernel::IsSupportedArgument up front and throw a clear
runtime_error for rejected shapes (was silently launching an unsupported
kernel).
* Switch to launch_kernel_time_mask with a clear_gemm_output preprocess
that zeroes C between iterations when k_batch > 1 (mirrors
universal_gemm_invoker). Without this the default -warmup=50 -repeat=100
accumulated 150 atomic_adds into C after the kernel-side split-K fix.

Tests (test/ck_tile/gemm_mx/):
- Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true).
- test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression),
SplitK (k_batch=2,4 across full_k/partial_k paths),
TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding}
covering trailing partial tiles along M, N, or both.
- Run(...) now takes k_batch.
- packScalesMNxK: guard against OOB (mn, k) reads from src and
initialise e8m0 bytes to the zero exponent (0x00) instead of the
default-constructed NaN (0xFF), so padded lanes don't poison the packed
int32_t shared with in-range lanes.
- test_mx_gemm_instance.hpp: call IsSupportedArgument before launch.

Verification on gfx950, ROCm 7.2.0:
- ctest -R test_ck_tile_mx_gemm -> 100% (2/2).
- Example sweep over the original bug-report shapes: all K-aligned
shapes now validate correct (including 4096^3 sk=2 and the K=768 cases);
all K=128 shapes cleanly rejected with the new error message instead of
producing silent wrong results.

Made-with: Cursor

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-15 08:28:55 +00:00
damien-lejeune
aab1d219f5 [rocm-libraries] ROCm/rocm-libraries#8350 (commit f92ded1)
Add tile shape for FMHA batch prefill on MI308X (on fp8,
 hdim=256) (#8350)

## Motivation

Add a tile size appropriate for FMHA batch prefill fp8/hdim256 on MI308X

## Technical Details

Appending the tile shape to the existing factory such that it can be
picked up by Aiter

## Test Plan

Ran the performance test on both MI300X and MI308X

## Test Result

MI300X performance seems unaffected by this change. MI308X does improve.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-06-15 07:00:35 +00:00
ltqin
0954a8f3fa [rocm-libraries] ROCm/rocm-libraries#8262 (commit d4ff8fc)
[CK_TILE] Add graph capture support for FMHA backward(new
 branch) (#8262)

## Motivation
Add HIP graph capture support for FMHA backward operations. The original
implementation only supported normal execution mode and would cause
use-after-free crashes when used with graph capture replay.
When FMHA backward is captured into a HIP graph:
- First replay: host callback executes and deletes the closure (as
designed for normal mode)
- Subsequent replays: use-after-free crash because the closure was
already freed
This PR enables `fmha_bwd_launcher::prepare_workspace_async()` to work
correctly in both normal execution and graph capture modes.
2026-06-14 03:11:53 +00:00
Chao
320a813d67 [rocm-libraries] ROCm/rocm-libraries#6533 (commit 5dcaa45)
[CK_TILE] Add host-side Pack-GQA optimization for FMHA
 forward (#6533)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[CK_TILE] Add host-side Pack-GQA optimization for FMHA forward

## Motivation

Host-side Pack-GQA optimization for CK-Tile FMHA forward. Reshapes Q
tensor
from `[b, nhead_q, seqlen_q, d]` to `[b, nhead_kv, nhead_ratio *
seqlen_q, d]`
by adjusting strides, so grouped Q-heads sharing the same KV data are
processed
in a single tile. Zero kernel changes — runner-only.

Phase 1: non-causal attention with GQA ratio packing.
Phase 2: extends to dropout and split-kv paths, fixes stride edge cases.

## Technical Details

Modified files (2):
- `example/ck_tile/01_fmha/example_fmha_fwd.cpp` — Pack-GQA flag
plumbing
- `example/ck_tile/01_fmha/fmha_fwd_runner.hpp` — Q tensor reshape
logic,
  stride adjustment for GQA ratio packing

New files (1):
- `example/ck_tile/01_fmha/test_pack_gqa_phase2.sh` — 53 test cases
covering
  non-causal, dropout, split-kv, various GQA ratios

## Dependencies

None — this PR is standalone.

## Test Plan

- GPU validation on MI300X (gfx942, ROCm 6.4.1):
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048
-d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- GPU validation on MI350X (gfx950, ROCm 7.0), 53 parameterized test
cases:
- Command (GQA 4:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=32
-h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- Command (GQA 8:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=64
-h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3`
- Command (decode): `./build/bin/tile_example_fmha_fwd -b=64 -h=32
-h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1
-repeat=3`

## Test Result

Benchmark results (MI350X, gfx950, ROCm 7.0):

| Config | Without Pack | With Pack | Improvement |
|--------|-------------|-----------|-------------|
| GQA 4:1 prefill b=2 h=32 hk=8 s=2048 d=128 bf16 | 690.05 TFlops (0.199
ms) | 695.61 TFlops (0.198 ms) | +0.8% |
| GQA 8:1 prefill b=2 h=64 hk=8 s=2048 d=128 bf16 | 706.25 TFlops (0.389
ms) | 729.35 TFlops (0.377 ms) | +3.3% |
| GQA 8:1 decode b=64 h=32 hk=4 s_k=4096 d=128 bf16 | 305.20 GB/s (1.763
ms) | 1813.41 GB/s (0.297 ms) | **+5.9x** |
| LLaMA-70B decode b=32 h=64 hk=8 s_k=4096 d=128 bf16 | 591.70 GB/s
(0.909 ms) | 1820.65 GB/s (0.295 ms) | **+3.1x** |
| MHA ratio=1 b=2 h=8 s=4096 d=128 bf16 | 695.16 TFlops | 702.72 TFlops
| no regression |

Benchmark results (MI300X, gfx942, ROCm 6.4.1):

No regression on MI300X. Pack-GQA is a runner-only optimization (zero
kernel changes), performance impact is within noise on MI300X.

| Config | TFlops / GB/s | Time (ms) | Delta vs baseline |
|--------|-------------|-----------|-------------------|
| MHA bf16 b=2 h=8 s=4096 d=128 | 336.52 TFlops | 0.408 | -1.7% |
| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 | 322.52 TFlops | 0.426 |
-0.7% |
| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 | 349.85 TFlops | 0.786 |
+0.5% |
| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 | 381.29 TFlops |
1.442 | +1.2% |
| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 | 697.32 GB/s | 1.541 |
+0.8% |

All validation tests pass (`valid:y`) on both MI300X and MI350X.

Additional validation:
- 53 parameterized test cases pass (23 phase 1 + 30 phase 2)
- GQA ratios tested: 1:1, 2:1, 4:1, 8:1, 32:1
- No regression on MHA (ratio=1) workloads
- fp16 and bf16 validated
2026-06-10 01:56:44 +00:00
Emily Martins
674f7cdc0e [rocm-libraries] ROCm/rocm-libraries#8141 (commit d3defa6)
[CK] Remove Stream-K from old CK

## Motivation

Since Stream-K has a CK Tile implementation, we no longer need Stream-K
in old CK. Hence, this PR removes Stream-K from old CK.

## Technical Details

All Stream-K artifacts in old CK have been removed including examples,
tests, kernels, and CK profiler artifacts.

## Test Plan

Ran a CI run on the branch before publishing PR.

## Test Result

All tests passed.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
2026-06-08 16:47:26 +00:00
Bartłomiej Kocot
2c363870d9 [rocm-libraries] ROCm/rocm-libraries#6744 (commit 9d056e8)
[Ck][CK Tile] Global Load/Store for Large Tensors support
 (#6744)

## Motivation

Create solution to support large tensors in the entire ck tile.

## Technical Details

- add possiblity to use global load
- int64 indexing

## Test Plan

conv fwd tests

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-913
2026-06-06 10:14:17 +00:00
Yung-sheng Tu
e826b2eb7e [rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

This PR adds two specialisations related to `tf32_t`.

## Technical Details

This change treats `tf32_t` as a concrete type rather than an empty
`struct`. It also adds two new specialisations for MFMA dense builtins
and resolves existing circular include issues.

## Test Plan

All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 12:27:41 +00:00
Sami Remes
ad4e2e7624 [rocm-libraries] ROCm/rocm-libraries#7199 (commit 23f7320)
[CK_TILE] [QuantGEMM] Fix SplitK tail handling and other
 improvements (#7199)

This pull request introduces improved and more robust split-K support
for quantized GEMM. The main changes add runtime validation, utility
functions for split-K batch calculations, pointer offset handling for
split-K in grouped kernels, and enhanced support for various tensor
layouts. The changes also improve error handling and provide more
flexibility for runtime tail handling in split-K pipelines.

**Split-K Support and Validation Enhancements:**

* Added runtime validation to ensure `k_batch` is a positive integer and
that split-K configurations do not produce empty final batches or
mismatched pipeline tails, with detailed error messages and logging for
misconfiguration.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250)
* Introduced utility functions `get_splitk_batch_k_read` and
`get_splitk_last_batch_k` to compute per-batch K read sizes and handle
split rounding, ensuring correct and consistent split-K batch
partitioning.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R206-R234)
[[2]](diffhunk://#diff-635b89bdffa96b2b42f1632520cde36701d7d631e864185591f6b32f7645cf47L104-R107)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L388-R417)
[[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250)
* Changed the default value of `k_batch` in `QuantGemmHostArgs` to 1 (no
split-K) for safer default behavior.

**Pointer Offsets and Grouped Kernel Handling:**

* Updated `QuantGroupedGemmKernel` to apply split-K per-batch offsets to
all input pointers, mirroring the behavior of non-grouped kernels and
ensuring correctness for split-K launches.
* Modified AQ tensor view handling to correctly reflect the remaining
K-groups from the split-K batch's offset position, improving accuracy
for split-K in grouped kernels.

**Pipeline and Layout Flexibility:**

* Added support for runtime selection of split-K tail handling via a new
template parameter `RuntimeSplitKTail_`, with new helper methods to
dispatch GEMM pipelines accordingly.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R273)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1496-R1567)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1427)
[[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1447-R1629)
[[5]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1459-R1641)
* Improved handling for tensor layout cases, including preshuffled B and
both row-major and column-major AQ layouts, ensuring correct pointer
arithmetic and compatibility checks.
[[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R438-R454)
[[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L464-R516)
[[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)
2026-06-05 11:41:49 +00:00
Enrico Degregori
7b9245f18c [rocm-libraries] ROCm/rocm-libraries#5854 (commit 8e2d46d)
[CK Tile] Async support preshuffle GEMM

## Motivation

Add async support to existing preshuffle GEMM pipeline

## Technical Details

Notes:
the implementation avoids previous strategy of duplicating pipelines for
async support and instead add a switch `Async` to the ops Problem to
enable async pipeline. Then, integrate the async pipeline in the
existing one. This allows to avoid code duplication and facilitate the
integration of buffer load to lds in existing pipelines. In my opinion,
it should be used also for other pipelines which don't support buffer
load to lds yet and it would also be a good idea to refactor the
existing async GEMM pipelines with the same approach.

Summary:

 - integrate buffer load to lds in existing pipeline
- add optimal tensor descriptors for vmem loading and lds reading. They
are currently optimized for 16x16 wave tiles but they also work for
32x32 wave tiles. Optimizations for 32x32 wave tile requires different
lds layout and it will be done in a follow-up issue
 - Add async config to examples
 - Add test (gfx950 only)

## Test Plan

New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 07:17:09 +00:00
Enrico Degregori
bdd7a8333d [rocm-libraries] ROCm/rocm-libraries#6672 (commit bda3f97)
[CK Tile] PermuteN support MX GEMM

## Motivation

Add PermuteN support to preshuffle MX GEMM

## Technical Details

 - Modify `shuffle_b_permuteN` to support MX preshuffled layout
- Add `preShuffleScalePermuteN` with same functionality of
`preShuffleScale` but layout consistent with PermuteN
 - Include MX pre-processing functions in the library

## Test Plan

Add test configuration for permuteN with preshuffle (both FP4 and FP8)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Cong Ma <congma13@amd.com>
2026-06-05 03:04:43 +00:00
John Afaganis
96c39b331e [rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)
[ck] Enforce ASCII-only C/C++ sources for hipRTC
 compatibility (#7829)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

CK source files must be compilable via **hipRTC (HIP runtime
compilation)**, whose preprocessor does not accept non-ASCII bytes
anywhere in a translation unit — **including in comments**. Bytes that
are harmless under `hipcc` (em-dashes, smart quotes, multiplication
signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at
preprocessing time. These regularly leak in via LLM-assisted authoring
or copy/paste from formatted documents and silently break hipRTC paths
that are not exercised by the default `hipcc`-based build matrix.

This PR (a) cleans every existing violation (53 files) and (b) adds a
pre-checkin gate so new violations are rejected before merge.

## File extensions covered

Both the cleanup scan and the new Jenkins enforcement stage use the same
predicate:

```
*.h  *.hpp  *.cpp  *.h.in  *.hpp.in  *.cpp.in  *.inc  *.cl
```

(excluding `*/build/*` and `*/include/rapidjson/*`). This is a strict
superset of the existing `Clang Format` stage's predicate — `*.inc` is
added so test-fixture include files are also gated. The local pre-commit
hook's `c++/inc` type filter covers the same set.

## Why no enforcement today

CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so
the existing `pre-commit` workflow doesn't touch CK. The local CK
`.pre-commit-config.yaml` only runs for developers who installed hooks.
The **authoritative gate is therefore the new Jenkins stage** in this
PR; the local hook is convenience.

## Commit layout (bisect-friendly)

1. `79798aa6261` — **`[ck] Convert reflect/ rendering to ASCII for
hipRTC compatibility`**
Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `|-
/ +- / | ` (3-col width preserved so alignment is unchanged).
`conv_description.hpp` swaps `×` for `x` as the dimension separator.
`test_conv_description.cpp` expected strings updated in lockstep so the
snapshot test stays green. This is the only commit in the series with
observable runtime impact.

2. `738fdb0d81c` — **`[ck] Strip non-ASCII bytes from C++ sources for
hipRTC compatibility`**
Mechanical text cleanup across 53 files. Replacements happen in comments
or in `std::cout` strings that are not asserted on by any test. None of
the 174 `.inc` files in the tree required edits, but they were in the
scan's predicate so the enforcement stage's predicate is a superset of
what was scanned. Full replacement table in the commit message.

3. `1d7cd8ba235` — **`[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility`**
- New `projects/composablekernel/script/check_ascii_only.sh` (modeled on
`check_copyright_year.sh`).
- New entry in `projects/composablekernel/.pre-commit-config.yaml` under
the local-hooks block (`types_or: [c++, inc]`).
- New `ASCII Only Check` parallel stage in
`projects/composablekernel/Jenkinsfile`'s `Static checks` block,
mirroring the existing `Clang Format` stage but with `*.inc` added to
the find predicate. Always-on, no `RUN_CPPCHECK` gate.

The tree is buildable at every commit boundary. Commit 1 leaves 50 known
violations; commit 2 leaves 0; commit 3 wires the gate.

## Demo

Script output on a synthesized violation:

```
$ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp
$ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp
ERROR: /tmp/bad.cpp contains non-ASCII bytes:
1:// em-dash test — here
  Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.)
$ echo $?
1
```

Full repo scan after the cleanup commits (note the `-name '*.inc'`
clause):

```
$ cd projects/composablekernel && find . -type f \( -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \
    -o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' \) \
    -not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \
  | xargs -0 -P 8 -n 64 script/check_ascii_only.sh
$ echo $?
0
```

## Test plan

- [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check`
stage runs green over the full predicate (incl. `*.inc`) and existing
`Clang Format` stage is unaffected.
- [ ] `test_conv_description` passes against the ASCII tree-formatter
output (touched in commit 1).
- [ ] Local: `pre-commit run ascii-only-checker --all-files` runs
cleanly after installing CK pre-commit hooks via
`script/install_precommit.sh`.
- [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file,
push: confirm Jenkins fails the new stage with a clear error.
- [ ] Spot-check a representative subset of touched files under hipRTC
compilation to confirm no remaining hipRTC-blocking content (optional,
since the static byte check is a sufficient condition for hipRTC
preprocessor acceptance on this dimension).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-04 15:00:17 +00:00
chris-tsiaousis-hpc
db05d61136 [rocm-libraries] ROCm/rocm-libraries#6212 (commit ccee58d)
=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

This PR solves several issues:

#### More accurate tests for MmaPipelines

The current tests for the MmaPipelines (test_amdgcn_sparse_mma,
test_amdgcn_wavewise_mma) use explicit input fragment vectors filled
with 1s, and only check the output of a single lane. We should have
tests that actually use the MmaPipelines with non-trivial input matrices
and verify the complete output.
Some other aspects of the current MmaPipelines tests that I noticed and
deserve some attention:

1. There is sometimes iteration over K outside of the pipeline, which is
then included in WaveTileK or FragK, which is not correct. We should
remove it, move K iteration inside of the pipeline, or be more clear
about this outer-K loop size and how it propagates downwards.
2. There is very tight coupling between the kernel, gtest code, and
test_pipeline helper, requiring a lot of information and functions to be
passed back and forth.
3. The test_pipeline helper is doing a bunch of register-related logic
on the host (related to point 1)
4. Without this register logic the only thing it does is check the
device, call the kernel, and check the output, but with a lot of
boilerplate.

#### Test helper for detecting target arch at HOST runtime

There is a really apparent issue we faced while writing tests:

Scenario:
1. Compile a test that supports both gfx950 and gfx1201 for gfx950
2. Run the test on a server that only has gfx1201 GPU

Actual:
Segmentation fault

Expected:
The test can correctly detect from HOST runtime that the DEVICE
target_id was different and skips the test.

Notes:

The only way of detecting the COMPILER_TARGET_ID in the existing "arch"
framework is launching a kernel and calling `get_compiler_target()` (so,
from a DEVICE code). This will create a segmentation fault if the
current arch differs from the target arch. To cope with this issue, we
propose to export the compiler target(s) (note they can be many) through
`projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and
define a test helper to deal with such cases.

#### Add composition support to Transforms

We have a small number of Transforms which act on MmaOp input and output
data, before and after the MmaOp call respectively. These are currently
implemented to work on an MmaTile level, but in theory they are also
supposed to work at a WaveTile level, i.e. after composition of multiple
MmaTiles to create larger effective MNK dimensions. Currently the
composed MmaTiles look like 2D C-style arrays of the individual MmaTile
level register vectors (see WaveWiseMmaPipeline). The transforms should
be able to take these and perform the proper transforms to the whole
WaveTile at once. This might allow for better performing
transformations.

Note: This PR handles the SparseTransform case and if we don't end up
doing scale as a transformation, there isn't really much left to do. If
we end up having only the sparse transform as a non-trivial transform,
then we could also consider removing the Transform framework.
2026-06-03 14:35:18 +00:00
Anton Gorenko
7ecbf82708 [rocm-libraries] ROCm/rocm-libraries#7500 (commit f5cd4fd)
[CK_TILE][FMHA] Optimize long-context decoding on gfx11/12
 (#7500)

## Motivation

Relevant issue: ROCM-22065

FMHA has less-than-optimal performance of long-context decoding (i.e.
when seqlen_q = 1) on gfx11/12.
This PR optimizes the splitkv pipeline and configs for such scenarios.

## Technical Details

Optimizations applied in this PR:
1. use tiles with smaller M0 (16 vs 64), these tiles are used when
seqlen_q <= 16
2. adapt qr_nwarp_sshuffle pipeline for gfx11, it allows to use more
warps even for M0 = 16 (the qr pipeline parallelizes work between warps
in M dim so with M0 = 16 it allows to use only 1 warp)
3. enable kMergeNumHeadGroupsSeqLenQ (an optimization that merges one
group of heads in GQA) for all hdim values, not only 128
4. increase the number of splits (multiply by the number of head groups)
if (3) is used
5. increase the number of splits for RDNAs (`multiProcessorCount` is the
number of WGPs on RDNAs, not CUs, so it should be doubled to have
meaning similar to CDNAs)

Performance on gfx1151:

| Case | develop (GB/s) | This PR (GB/s) |
|:-------|-------:|-------:|
| [fp16\|group\|bshd] b:1, h:32/32, s:1/45056, d:64/64 | 127.58 | 183.11
|
| [fp16\|group\|bhsd] b:1, h:32/32, s:1/45056, d:64/64 | 153.64 | 215.02
|
| [fp16\|group\|bshd] b:1, h:16/8, s:1/77184, d:128/128 | 120.51 |
225.76 |
| [fp16\|group\|bhsd] b:1, h:16/8, s:1/77184, d:128/128 | 130.62 |
223.84 |
| [fp16\|group\|bshd] b:1, h:32/32, s:1/9600, d:128/128 | 82.65 | 138.44
|
| [fp16\|group\|bhsd] b:1, h:32/32, s:1/9600, d:128/128 | 105.75 |
220.45 |
| [fp16\|group\|bshd] b:1, h:8/1, s:1/401024, d:256/256 | 16.27 | 187.89
|
| [fp16\|group\|bhsd] b:1, h:8/1, s:1/401024, d:256/256 | 16.28 | 188.19
|

## Test Plan

An additional test case is added to the exiting test. It uses seqlen_q =
1, GQA, no mask to trigger the changes
```
ninja test_ck_tile_fmha_fwd_fp16 && bin/test_ck_tile_fmha_fwd_fp16 --gtest_filter="*SplitKV*
ninja test_ck_tile_fmha_fwd_bf16 && bin/test_ck_tile_fmha_fwd_bf16 --gtest_filter="*SplitKV*
```

Manual testing can be done with these commands:
```
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=64  -s=1 -s_k=$((352 * 128))  -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=16 -h_k=8  -d=128 -s=1 -s_k=$((603 * 128))  -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=128 -s=1 -s_k=$((75 * 128))   -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=8  -h_k=1  -d=256 -s=1 -s_k=$((3133 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1
```

## Test Result

All the tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-03 06:16:10 +00:00
Hosang Yoon
e7e8801dc3 [rocm-libraries] ROCm/rocm-libraries#7586 (commit c18f2c7)
[CK_TILE] Use gfx11 float buffer atomics in FMHA Bwd

## Motivation

FlashAttention CK backward on gfx11 can hit out-of-bounds/tail writes in
the dQ accumulator atomic-add path when sequence rows are padded at the
tile level but not marked invalid in the DQDKDV main tensor view.

With the generic global atomic fallback, an incorrectly-valid tail
element can issue an actual pointer-based `atomicAdd`. With the buffer
atomic path, the write is issued through a buffer resource with bounds
information and follows the same backend already used by gfx9/gfx12.

This fixes the gfx11 FMHA BWD failure without changing the gfx11 default
for unrelated CK Tile kernels.

## Technical Details

This PR enables the existing CK Tile AMD buffer float atomic-add path
only for generated FMHA BWD gfx11 translation units.

gfx11 normally uses the generic global atomic fallback for
floating-point `buffer_view::atomic_add`. That fallback performs the
atomic through a raw computed pointer and depends on the software
validity predicate to avoid invalid elements. In FMHA BWD dQ
accumulation, padded tail rows can reach this path, so using the buffer
atomic backend is safer: it uses a buffer resource with base pointer,
bounds information, and an element offset, matching the backend already
used by gfx9/gfx12.

Enabling `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT` globally for gfx11 is
too broad and can break unrelated gfx11 CK builds such as GEMM. Instead,
`config.hpp` now preserves an explicitly pre-defined
`CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT`, while keeping the existing
default disabled for gfx11.

## Test Plan

Validated the change with the FlashAttention CK full test suite with
backward pass enabled on gfx11.
pytest -q -s tests/test_flash_attn_ck.py

## Test Result

FlashAttention CK gfx11 test result:
260680 passed, 152076 skipped

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-05-30 00:10:26 +00:00
Andriy Roshchenko
d5c9215064 [rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC

## Motivation

Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async
load instruction for tensor A, which complements the existing MX GEMM
FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines
are enabled on gfx1250.

## Technical Details

The existing gfx950 implementation was extended to support gfx1250
architecture. All three MX FP data types are supported across the two
ASICs.
It should be noted that while the TDM pipeline uses an emulated
32x32x128 warp-tile instruction, the present submission relies on the
built-in 16x16x128 instruction, called 4 times per warp.

## Test Plan

Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250
functionality.

To help facilitate the testing in development,
`example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was
introduced to verify various combinations of supported data types and
pipeline versions.

## Test Result

The present submission is expected to work on both gfx950 and gfx1250
hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
- [x] Relies on #6978 and should only be merged after the changes are
merged to the `develop`.
2026-05-29 17:02:45 +00:00
Illia Silin
8bd8094012 [rocm-libraries] ROCm/rocm-libraries#7833 (commit 8a444cd)
[CK] Replace deprecated load_module function in python
 (#7833)

## Motivation

Recent pytorch builds with python 3.15 failed in CK due to deprecation
of load_module function. This should fix the issue.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 15:29:58 +00:00
ltqin
bf07a0150e [rocm-libraries] ROCm/rocm-libraries#7723 (commit 4ed6c51)
[CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels
 (#7723)

###  Motivation
The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not
support LSE (Log-Sum-Exp) output. This PR enables LSE output support for
fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics
alongside attention outputs.
### Technical Details
    - StandardAttention: lse = softmax_scale * m + log(l)
- LogitsSoftCap: lse = (m / log2(e)) + log(l)

### Test Plan
Run FMHA forward example with fp8bf16 precision and LSE output enabled:
- Test 1: Basic LSE functionality
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1
- Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter)
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0
2026-05-28 15:58:54 +00:00
Zoltán Lakatos
58e2ab1fc7 [rocm-libraries] ROCm/rocm-libraries#6761 (commit d19f6f1)
[CK] Large tensor gemm workaround (#6761)

## Motivation

Customer qeruested large tensor gemm support for 8bit and 4bit data
types. Currently CK triggers “This GEMM not supported” error. The root
cause appears to be the 2 GB limit on the input/output matrix, triggered
by buffer offset constraints when testing a larger shape such as M =
699,904 (which is an exact multiple of MPerBlock = 256).

## Technical Details

Quick workaround to have support ASAP. Split the tensors into inputs /
outputs smaller than 2GB limit. Iterate on host and call all subproblems
without device code change.
Support is restricted to rowise layout in A, Ds and E

All changes were implemented in DeviceGemm structures to avoid secondory
affect on grouped convolutions.

Got lots of AI generated comments. Addressed the ones that seemed
relevant on the functionality.

## Test Plan

Within CK the following examples can be used with modified input sizes:
example_gemm_multiply_multiply_xdl_fp8
example_gemm_mx_fp4
Tested with Aiter tuning on provided shapes.

## Test Result

All gemms run and provide correct results.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-05-27 18:55:15 +00:00
Illia Silin
c24e528481 [rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)
[CK] suppress compiler warnings while building pytorch. (#7760)

## Motivation

Recently added compiler flags that are required to suppress false
warnings by latest staging compiler are not recognized by older compiler
versions and are triggering an avalanche of warnings. Previous attempt
to suppress them by using -Wno-unknown-warning-option flag didn't help,
because that flag wasn't recognized either and just added more warnings.
I've verified that current approach by checking the clang version
actually works as intended and makes the warnings go away.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-27 06:56:58 -07:00
JH-Leon-KIM-AMD
00e1d82ae7 [rocm-libraries] ROCm/rocm-libraries#7732 (commit b0e29d9)
[CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732)

## Motivation

Fix silent miscompute in the grouped convolution backward-data kernel
(`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride >
dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast
path that dropped all but the first sub-GEMM, producing zeroed slices of
`dx` on
the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness
without giving up the perf gains PR #6208 delivered on stride=1 shapes.

## Technical Details

- Tighten the flat-descriptor fast-path gate to require
`arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its
original purpose). For stride > 1, the implicit GEMM is split into
`gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly;
routing them through the flat path required dropping all but the first,
which was the source of the bug.
- Stride > 1 now falls through to the existing grouped CShuffle path,
which packs all sub-GEMMs into one descriptor array and walks them
on-device in a single kernel launch. This is the pre-PR-6208 production
path; correctness is established and per-dispatch launch count is
minimised.
- Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0)
intersection in
`test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with
`gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this
intersection (all stride>1 cases used G=2; all G=1 cases used stride=1),
which is why PR #6208's regression slipped past CI.

## Test Plan

- `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke
tier — runs on every PR via `smart_build_and_test.sh`).
- End-to-end verify (`verify=1`) via
`example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes
including the original ALMIOPEN-1959 case and a cross-bucket
(`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32`
buckets.
- ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate
via an environment variable: full kernel-family enumeration, winning
kernel + its avg_time reported under each gate. 33/41 shapes completed
before the sweep was stopped; the remaining 8 were the largest
i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape
enumeration budget (not relevant to the verdict).

## Test Result

### Correctness

| Test | Result |
|---|:---:|
| `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D,
includes 3 new regression shapes) | **12/12 PASSED** in 14.18 s |
| `test_grouped_convnd_bwd_data_interface` (API checks) | **PASSED** in
0.28 s |
| ALMIOPEN-1959 stride=2 (`verify=1`) | **PASSED** |
| stride=1 K3 (`verify=1`) | **PASSED** |
| stride=3 K3 `gemms_count=9` (`verify=1`) | **PASSED** |
| stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) | **PASSED** |

### Performance (ckProfiler A/B on gfx942 / MI300X)

Comparing the **post-fix gate** (flat path only when `gemms_count_==1`,
column "B") vs the **inner-loop variant** that keeps the flat path on
stride>1 (column "A") across 25 stride>1 shapes where production picks
a `_v1` instance (so the gate actually fires):

| Stride | Shapes | A wins | Tie | B wins | Notes |
|:------:|:------:|:------:|:---:|:------:|---|
| 1 (sanity, gate moot) | 3 | 0 | 3 | 0 | gate doesn't differentiate — A
== B as expected |
| > 1 (gate fires) | 25 | **0** | 11 | **14** | B wins +6% to +32%; A
never wins |

Highlights from the firing-gate cases:

| Shape (G=1, stride=2 unless noted) | A ms | B ms | B vs A |
|---|---:|---:|---:|
| ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) | 0.183 | 0.171 | **B
+6%** |
| Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) | 0.054 | 0.045 | **B +17%**
|
| i2v-010 (N=1, K=C=384, 3×3, 277×209) | 0.174 | 0.125 | **B +28%** |
| Synthetic 50×50 K3 N=32 K=C=256 | 0.131 | 0.088 | **B +32%** |

Why B wins everywhere the gate fires: for `gemms_count = N`, the flat
path needs N kernel launches (one per sub-GEMM), while the grouped path
loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) ×
launch-tax is a structural disadvantage A can't recover from.

### Diff

| File | Lines |
|---|---:|
|
`include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp`
| +14 / −8 (one extra condition + expanded dispatch comment) |
| `test/.../test_grouped_convnd_bwd_data.cpp` | +9 / −0 (3 new shapes) |

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-27 09:59:14 +03:00
Aviral Goel
0df3523ef1 [rocm-libraries] ROCm/rocm-libraries#6807 (commit ddda8ac)
[CK_TILE] Add save_matrix_txt() and extract HostTensor I/O to free functions (#6807)

## Summary
- Extract `loadtxt`, `savetxt`, and `save_matrix_txt` from `HostTensor`
member functions into standalone free functions in `host_tensor_io.hpp`
(Single Responsibility Principle)
- Add `save_matrix_txt()` for writing 2D tensors to space-separated text
files with configurable output limit (default 256x256, pass 0 to dump
all)
- Supports float, int, and int8_t output formats via a `dtype` parameter
- Validate dtype early and throw on unsupported values in all three
functions
- Update callers in `15_fused_moe/main.cpp` to use free function syntax
2026-05-26 11:07:18 -04:00
Anton Gorenko
66d6714376 [rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd)
[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388)

## Motivation

Improve precision of mxfp4 without performance penalties.

## Technical Details

Since performance of scale MFMAs is the same when neither A nor B is
fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the
second GEMM, while types of Q, K, V stay the same.
This allows to improve overall precision significantly because fp6 has
32 non-negative values used for P quantization compared to just 8 values
for fp4.

It was found that there is a compiler bug with
`__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in
LCOMPILER-561) but a workaround seems to fix all failing instances.

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-26 06:55:17 -07:00
Yi DING
6a9c03f692 [rocm-libraries] ROCm/rocm-libraries#7450 (commit 402dbad)
[CK_TILE] Use Persistent Scheduling for FMHA BWD Group Deterministic (#7450)

## Motivation

FMHA BWD group-mode deterministic currently uses a non-persistent
scheduler: each `(batch, head, K-row)` work-item is launched as its own
block, with no work-stealing across CUs. On uneven workloads (varlen,
GQA, many heads with
few K-rows) this leaves CUs idle and forces a larger dq_acc workspace
than necessary.

This PR ports the persistent + deterministic scheduling already used in
batch mode to group mode: a fixed-grid kernel that pre-computes per-CU
work ranges on the host and uses sparse dq_acc slot indexing so multiple
K-rows handled
by the same CU share one accumulator slot via intra-CU atomic adds.

Stacked on #7331; merge that first.

## Technical Details

Single file changed: `ops/fmha/kernel/fmha_bwd_kernel.hpp`.

A new `kUsePersistent` path is added to the group-mode deterministic
kernel, mirroring the batch-mode persistent scheduler. The host
pre-computes a fixed per-CU partition of the total `(batch, head,
K-row)` work and packs it into
`cu_states[]` so the GPU consumes it in a single launch. Host
preparation happens in four steps:

1. Build per-batch `seqstart` prefix sums.
2. Fill per-batch `(sq_w, nc)` with a placeholder `nsplits` (bumped in
step 3).
3. Two-pointer scan over CUs to fill `cu_states[c]` (`isplit`,
`head_start`, `c_start`, `w_lo`, `w_hi`), accumulating `nsplits[b]` as
`max(cs->isplit + 1)`.
4. Compute compact per-batch dq_acc offsets from the finalized
`nsplits`.

`isplit` is the sparse dq_acc slot index — one CU's multi-K-row writes
share slot `ceil(wc_start / denom)`, enabling intra-CU atomic
accumulation instead of one slot per K-row.

`denom = max(sq_w, target_w)`, splitting two regimes:

- `target_w >= sq_w` (large work): `denom = target_w`, intra-CU atomic
optimization engaged.
- `target_w < sq_w` (sub-K-row sharding, multiple CUs sharing one
K-row): `denom = sq_w` collapses to per-K-row indexing (`= c_start`),
keeping `isplit ∈ [0, nc-1]` and matching the `nsplits_max =
ceil(s_k/kN0) = nc` upper bound that #7331's
`GetWorkspaceDeviceSizeUpperBound` assumes for group+det.

`isplit` is additionally clamped to `nc-1` to absorb empty CUs
(rounded-up `wc_start` past the last K-row); they don't write dq_acc on
GPU so the slot value is harmless.

`nsplits[b]` is accumulated dynamically in step 3 rather than via a
closed form so it tightly matches the actual sparse slots used; step 4
(offsets) follows step 3 since offsets now depend on the dynamic
`nsplits`.

Group mode also allows batches with `seqlen_q == 0`. The persistent
scheduler skips them on the dQ path (no work) but dK/dV are still
zero-filled.

## Test Plan

Built `tile_example_fmha_bwd` with receipt 5 (fp16, no-bias, no-dropout,
`dpad == dvpad`, group + batch) on gfx950 (MI355X).

- 8-case smoke (shapes that exercise the sub-K-row regime).
- 44-case sweep covering: mask 0/1/2, GQA, var seqlen, `d != d_v`,
extreme
  small seqlen / `nc=1`, CU >> work, huge batch, batch-mode regression.
- 12-case perf comparison vs the non-persistent baseline (warmup=10,
  repeat=50).

## Test Result

- All 8 + 44 cases `valid:y`.
- Perf: ±5% noise, average -0.4% across the 12 cases — neutral.
- Batch-mode deterministic / non-deterministic regression unchanged.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-26 10:01:54 +08:00
Michal Kulikowski
8de4cb72fb [rocm-libraries] direct push (commit 49b73ad)
[CK][CK_TILE] POC for Instruction Cache prefetch.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2026-05-25 11:26:26 +02:00
JP-Fernando
74bc86240b [rocm-libraries] ROCm/rocm-libraries#5647 (commit 490437a)
[CK Tile] Add gemm universal preshuffle to MX GEMM  (#5647)

## Motivation

Add gemm universal preshuffle support to existing MX GEMM pipeline.

The straightforward way to do this is to port the `mx_flatmm` pipeline
to the existing `gemm_mx` framework.

## Technical Details

The `mx_flatmm` pipeline was not deleted, to allow for
back-compatibility.

## Test Plan

Add `preshuffle` option to example: `tile_example_mx_gemm`.

Add new configurations with enabled preshuffle to the existing
`test/ck_tile/gemm_mx` tests.

## Test Result

Example and tests were successful on `gf950` architecture in the `Alola`
cluster.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
2026-05-22 16:07:53 +02:00
Bartłomiej Kocot
ebb97044f4 [rocm-libraries] ROCm/rocm-libraries#7664 (commit de5d6b1)
Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664)

## Motivation

Incorrect results has been introduced for some conv bwd cases.

## Technical Details

This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3.

## Test Plan

CI

## Test Result

Pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

ALMIOPEN-1959
2026-05-22 12:28:49 +00:00
Illia Silin
e02c566795 [rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)
[CK] upgrade CI to rocm7.13 as default compiler (#7612)

## Motivation

Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2026-05-22 02:43:50 +00:00
kensclin
fc2862d712 [rocm-libraries] ROCm/rocm-libraries#6846 (commit 377def4)
[CK_TILE] Add fmha forward hdim 256 support (#6846)

## Motivation

Enable Composable Kernel FMHA forward kernel for **hdim=256 BF16** on
AMD gfx950 (MI350X). Prior to this change the (256, 256) head-dim
configuration either failed to compile, was filtered out by the
compatibility rules, or produced incorrect kernel output due to an LDS
layout accounting bug.

## Technical Details

  Four files changed, all to enable hdim=256 BF16 on gfx950.

- **`fmha_fwd.py`** — Allow `(256, 256)` in gfx950 compatibility rule;
set `(256,256)` BF16 tile to `M0=128, N0=64` (the LDS-feasible shape on
gfx950); emit minimal valid instance set for d=256 to bound compile
time.

- **`fmha_fwd_kernel.hpp`** — Gate Prefill launch path off for d=256
(`PrefillCase = kM0 > 64 && kQKHeaddim < 256`); the double-buffer
Prefill variant overflows the 160 KB LDS budget.

- **`trload_policy.hpp`** — **Critical correctness fix**: the LDS layout
accounting in `GetSmemSize` was wrong (`max(Q, K+S+V)` instead of
`max(Q, K) + V + S`), under-allocating LDS and silently corrupting d=256
output (~2% wrong values).

- **`trload.hpp`** — Thread `LoadOnce=true` through all d=256 K-LDS
descriptors so the compiler picks the matching XOR swizzle period;
recompute the S-tile LDS offset to match the corrected `GetSmemSize`
formula.

## Test Plan

Built and ran `tile_example_fmha_fwd` on gfx950 (MI350X) with the
canonical d=256 BF16 configurations:

  ```bash
  cd build && ninja tile_example_fmha_fwd
./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=1 -h=32 -h_k=2
-s=1024 -s_k=1024 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10
-kname=1 -v=1
./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=8 -h=32 -h_k=2
-s=16384 -s_k=16384 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3
-repeat=10 -kname=1 -v=1
  ```

## Test Result

  ```bash
-b=1 -s=1024
[bf16|batch|bhsd] b:1, h:32/2, s:1024/1024, d:256/256, scale_s:0.0625,
bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r,
fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink,
0.058 ms, 298.42 TFlops, 618.68 GB/s, valid:y

-b=4 -s=16384
[bf16|batch|bhsd] b:8, h:32/2, s:16384/16384, d:256/256, scale_s:0.0625,
bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r,
fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink,
42.797 ms, 822.18 TFlops, 106.63 GB/s, valid:y
  ```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com>
2026-05-22 01:57:41 +08:00
JP-Fernando
e7798e9560 [rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac)
Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112)

## Motivation

The goal of this work is to apply XOR shuffle (swizzle) to the current
`comp_async` GEMM pipeline and the `gemm_mx` pipeline.
XOR swizzling has been helpful to avoid LDS bank conflicts, as data are
redistributed across LDS banks, such that simultaneous threads accessing
different rows land on different LDS banks.

## Technical Details

A similar approach to the work in the existing eight-waves pipeline was
followed.
Currently, XOR swizzle support is available for FP8 and BF8 types.
FP4 support is also available for MX GEMM.
Should the types not match, or should the async vector width be of an
unsupported size, then the pipeline falls through to the previously
existing ('unswizzled') path.

## Test Plan

Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM
pipeline.
Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for
the MX GEMM pipeline.

## Test Result

The tests passed successfully in the `Alola` cluster with MI350
hardware.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-05-21 09:36:41 +02:00