Commit Graph

1622 Commits

Author SHA1 Message Date
Aviral Goel
15c904b460 [rocm-libraries] ROCm/rocm-libraries#7724 (commit 4cb149a)
ck_tile: add FillUniformScaleDistribution and fix MX GEMM
 scale init (#7724)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

### Problem
MX GEMM pipeline tests were passing vacuously: scale bytes were drawn
from a fixed range (40–60) which, for e8m0, maps to scales ≈ 10⁻²⁷ — far
below FP16 min denorm. Both GPU and CPU produced all-zero outputs, so
numerical checks passed without exercising the GEMM.

### Changes

**`include/ck_tile/host/fill.hpp`** — new
`FillUniformScaleDistribution<ScaleType>` functor
- Accepts human-readable float bounds and maps them to the raw byte
range of any ExMy scale type (e8m0, e4m3, e5m3) by re-centering the IEEE
754 exponent into the type's bias space
- Sampling is uniform over raw bytes → uniform over representable values
- Fixes left-shift UB: uses multiplication instead of `<< mant_bits` to
avoid shifting negative signed integers (C++17 UB)
- Adds `assert(min_r <= max_r)` to catch inverted-range UB when both
bounds exceed the type's representable range
- Provides default member values (0.125f, 2.0f) and `std::optional` seed
consistent with sibling fillers
- `/** */` Doxygen style with `@note` on snapping asymmetry

**`test/ck_tile/gemm_mx/test_mx_gemm_pipeline_util.hpp`** — fix scale
initialization
- Replace manual byte-range distribution with
`FillUniformScaleDistribution<>{0.125f, 2.0f}`
- Use distinct seeds for scale_a (11941) and scale_b (11943) to avoid
correlated scale tensors that were causing 60 test failures for
fp4+e5m3/e4m3 combinations

**`test/ck_tile/utility/test_fill.cpp`** — new unit tests for
`FillUniformScaleDistribution`
- 16 typed tests across e8m0, e4m3, e5m3: validity, range,
reproducibility, coverage, snapping, stress, nullopt seed, and range
overload
- Test helper `expected_raw_range` mirrors implementation clamping
exactly
2026-05-29 18:45:13 +00:00
Andriy Roshchenko
d5c9215064 [rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)
[CK_TILE][GFX1250] Enable MX GEMM FLATMM with ASYNC

## Motivation

Enables MX GEMM FLATMM pipeline on gfx1250. The pipeline uses an async
load instruction for tensor A, which complements the existing MX GEMM
FLATMM pipeline with TDM load. At this time, only FLATMM MX pipelines
are enabled on gfx1250.

## Technical Details

The existing gfx950 implementation was extended to support gfx1250
architecture. All three MX FP data types are supported across the two
ASICs.
It should be noted that while the TDM pipeline uses an emulated
32x32x128 warp-tile instruction, the present submission relies on the
built-in 16x16x128 instruction, called 4 times per warp.

## Test Plan

Existing `test/ck_tile/flatmm` tests were extended to cover new gfx1250
functionality.

To help facilitate the testing in development,
`example/ck_tile/18_flatmm/script/smoke_test_mx.sh` script was
introduced to verify various combinations of supported data types and
pipeline versions.

## Test Result

The present submission is expected to work on both gfx950 and gfx1250
hardware for all reasonable sizes and all MX FP8/FP6/FP4 data types.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
- [x] Relies on #6978 and should only be merged after the changes are
merged to the `develop`.
2026-05-29 17:02:45 +00:00
Bartłomiej Kocot
5d912538d3 [rocm-libraries] ROCm/rocm-libraries#7847 (commit b995ef2)
[CK] Remove IsPackedTensor function

## Motivation

Fix codegen hipRTC

## Technical Details

Remove not needed function. Since MakeArgument supports long_index_t
strides.

## Test Plan

Codegen tests.

## Test Result

Passed.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 14:00:06 +00:00
Ville Pietilä
78d657c4f7 [rocm-libraries] ROCm/rocm-libraries#7284 (commit e7d25b2)
[CK_TILE] Integrate CK Tile Dispatcher code generation into
 CK Tile Profiler (#7284)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

CK Tile is going to be delivered to hipDNN via CK Dispatcher. Currently
the CK Tile Profiler using CK Builder for generating the profiled
instances from the configuration files that identify the instances that
old CK exposes. We need to replace this instance generation with the CK
Tile Dispatcher codegen.

## Technical Details
The old CK Profiler config files are converted to JSON files that the CK
Tile Dispatcher can digest. The conversion script for configurations is
stored to source control in case we need to update the JSON
configurations later. The dispatcher generates instance libraries per
conv direction (fwd, bwd data, and bwd weight) that are linked to the CK
Profiler executable. I also implemented codegne for the stream-K and
depthwise conv instances. The proposed solution replaces the CK Builder
codegen with the CK Tile Dispatcher codegen.

There are two new methods that are exposed via the dispatcher backend

- `is_supported` - required to enabled the profiler workflow where we
check the applicability of the kernel instance before running it.
- `get_instance_string` - this mainly for verification. This provide the
CK Builder instance string for verifying that the old CK Builder based
profiler and the new CK Tile Dispatcher based profiler have the same
instances.

The rules that limit the generated instances are now collected to a
single location under the dispacther. The CK Builder codegen uses these,
which ensures that the two codegen pipelines are in sync. The next step
(different PR) is to remove the CK Builder codegen pipeline altogether.

## Test Plan

Verified that the old CK Builder based profiler and the new CK Tile
Dispatcher based profiler have the same instances, that is, the
Dispatcher based codgen can generate the same instances as the old CK
Builder.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-28 21:03:37 +00:00
ltqin
bf07a0150e [rocm-libraries] ROCm/rocm-libraries#7723 (commit 4ed6c51)
[CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels
 (#7723)

###  Motivation
The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not
support LSE (Log-Sum-Exp) output. This PR enables LSE output support for
fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics
alongside attention outputs.
### Technical Details
    - StandardAttention: lse = softmax_scale * m + log(l)
- LogitsSoftCap: lse = (m / log2(e)) + log(l)

### Test Plan
Run FMHA forward example with fp8bf16 precision and LSE output enabled:
- Test 1: Basic LSE functionality
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1
- Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter)
./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128
-prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0
2026-05-28 15:58:54 +00:00
Zoltán Lakatos
58e2ab1fc7 [rocm-libraries] ROCm/rocm-libraries#6761 (commit d19f6f1)
[CK] Large tensor gemm workaround (#6761)

## Motivation

Customer qeruested large tensor gemm support for 8bit and 4bit data
types. Currently CK triggers “This GEMM not supported” error. The root
cause appears to be the 2 GB limit on the input/output matrix, triggered
by buffer offset constraints when testing a larger shape such as M =
699,904 (which is an exact multiple of MPerBlock = 256).

## Technical Details

Quick workaround to have support ASAP. Split the tensors into inputs /
outputs smaller than 2GB limit. Iterate on host and call all subproblems
without device code change.
Support is restricted to rowise layout in A, Ds and E

All changes were implemented in DeviceGemm structures to avoid secondory
affect on grouped convolutions.

Got lots of AI generated comments. Addressed the ones that seemed
relevant on the functionality.

## Test Plan

Within CK the following examples can be used with modified input sizes:
example_gemm_multiply_multiply_xdl_fp8
example_gemm_mx_fp4
Tested with Aiter tuning on provided shapes.

## Test Result

All gemms run and provide correct results.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-05-27 18:55:15 +00:00
Michael Halkenhäuser
5dc8fbd1a8 [rocm-libraries] ROCm/rocm-libraries#6900 (commit 28608c2)
[CK] Fix and expand CK's commit records in version.h (#6900)

## Motivation

In `version.h` of a CK installation `CK_COMMIT_ID` would be empty for
out-of-source builds.
Additionally, if it worked, it would show the parent repo's
(`rocm-libraries`) commit.

## Technical Details

Dropped "required" constraint so "unknown" string becomes a graceful
option.

Changed process of determining the CK commit, now uses
`WORKING_DIRECTORY`.
Thus, `CK_COMMIT_ID` holds only the last CK-relevant commit.

Added `CK_PARENT_COMMIT_ID` which holds the parent's, e.g.
`rocm-libraries`, commit.
This can be the same as `CK_COMMIT_ID`, or not even applicable,
depending on the scenario.

## Test Plan

Ran CMake configuration and installation of CK to verify happy path.

## Test Result

Commit SHA's showed the expected values depending on the repo state.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-27 10:17:02 -07:00
Aviral Goel
4aecc8de5b [rocm-libraries] ROCm/rocm-libraries#7442 (commit b7d57ef)
[CK] CompV4: remove redundant barrier (+5.7% gfx942, +1% gfx950) (#7442)

## Summary

- Remove one redundant `block_sync_lds()` from the pong phase of the
CompV4 GEMM pipeline hot loop
- The pong phase had 2 barriers while ping had 1 — the second pong
barrier (after LDS writes, before global loads) was unnecessary because
the sync at the top of the next ping iteration already ensures LDS
coherence
- Removing this barrier allows global loads to overlap with LDS write
drain, restoring the latency hiding the ping-pong design was built to
provide
- Abstracting away Ping Pong phases into generic lambda avoids making
such mistake again.

## Benchmark

### gfx942 (MI300X), 86 fp16 GEMM shapes

| Metric | Value |
|---|---|
| Improved (>1%) | **80** |
| Neutral (±1%) | **4** |
| Regressed | **2** |
| Average gain | **+5.7%** |
| Best gain | +18.0% (4096x256x16384) |
| Worst regression | -2.9% (12288x3072x4096) |

### gfx950 (MI355X), 86 fp16 GEMM shapes

| Metric | Value |
|---|---|
| Improved (>1%) | **32** |
| Neutral (±1%) | **54** |
| Regressed | **0** |
| Best gain | +9.0% (4096x2048x28672) |

### Top gains by workload

| Shape (MxNxK) | Source | gfx942 BL | gfx942 Opt | gfx942 Gain | gfx950
BL | gfx950 Opt | gfx950 Gain |
|---|---|---|---|---|---|---|---|
| 4096x256x16384 | bloom_fc2 | 38.3 | 45.2 | **+18.0%** | 75.6 | 77.0 |
+1.9% |
| 4096x512x22016 | llama2_7b | 77.8 | 90.8 | **+16.7%** | 152.4 | 154.9
| +1.7% |
| 256x1536x7168 | deepseek | 14.4 | 16.7 | **+16.0%** | 27.2 | 28.0 |
+2.8% |
| 4096x1024x22016 | llama2_7b | 156.2 | 180.8 | **+15.7%** | 304.8 |
311.6 | +2.2% |
| 4096x1024x16384 | bloom_fc2 | 154.6 | 178.5 | **+15.4%** | 303.1 |
309.5 | +2.1% |
| 4096x4096x22016 | llama2_7b | 371.0 | 412.3 | **+11.1%** | 819.8 |
823.6 | +0.5% |
| 4096x2048x28672 | llama3_8b | 235.5 | 259.5 | **+10.2%** | 530.0 |
577.7 | **+9.0%** |
| 250880x256x4096 | bloom_logits | 289.0 | 335.9 | **+16.2%** | 595.5 |
599.1 | +0.6% |
| 8192x8192x8192 | square | 411.8 | 432.9 | **+5.1%** | 825.1 | 825.8 |
+0.1% |
| 7168x4096x8192 | llama70b | 362.9 | 374.7 | **+3.3%** | 775.8 | 782.5
| +0.9% |

## Hardware counter analysis (rocprof-compute, 8192x8192x8192, gfx942)

| Metric | Baseline | Optimized | Delta |
|---|---|---|---|
| s_barrier per ping+pong | 5 | 4 | **-1** |
| MFMA Utilization | 47.8% | 55.5% | **+7.7pp** |
| IPC | 0.17 | 0.21 | **+23.5%** |
| MFMA F16 % of peak | 30.6% | 33.5% | **+2.8pp** |
| VALU (instructions) | 41.67M | 41.67M | identical |
| MFMA (instructions) | 65.91M | 65.91M | identical |
| Spill/Stack Read | 8.27M | 8.27M | identical |

All instruction counts are identical — the optimization removed one
synchronization point, not any compute instructions.

## Correctness

- gfx942: GPU verification (`-v=2`) passed on 4 shapes (8192x8192x8192,
4096x4096x4096, 22016x4096x4096, 4096x512x28672)
- gfx950: GPU verification (`-v=2`) passed on all 86 shapes
2026-05-27 12:23:43 -04:00
Illia Silin
c24e528481 [rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)
[CK] suppress compiler warnings while building pytorch. (#7760)

## Motivation

Recently added compiler flags that are required to suppress false
warnings by latest staging compiler are not recognized by older compiler
versions and are triggering an avalanche of warnings. Previous attempt
to suppress them by using -Wno-unknown-warning-option flag didn't help,
because that flag wasn't recognized either and just added more warnings.
I've verified that current approach by checking the clang version
actually works as intended and makes the warnings go away.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-27 06:56:58 -07:00
Bartłomiej Kocot
60df79085d [rocm-libraries] ROCm/rocm-libraries#7631 (commit d591a7c)
[CK] Grouped Convolution Global Load/Store support (#7631)

## Motivation

Grouped Convolution Global Load/Store support to cover large tensor
cases.

## Technical Details

Utilize global load for grouped convolution forwad kernels. Update
Indexes to use int64.

## Test Plan

- test utils
- test conv kernels in next pr with instances

## Test Result

CI pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-1255

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-27 08:21:54 +00:00
JH-Leon-KIM-AMD
00e1d82ae7 [rocm-libraries] ROCm/rocm-libraries#7732 (commit b0e29d9)
[CK] Fix grouped conv bwd data stride>1 silent miscompute (ALMIOPEN-1959) (#7732)

## Motivation

Fix silent miscompute in the grouped convolution backward-data kernel
(`DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1`) when stride >
dilation (ALMIOPEN-1959). PR #6208 introduced a flat-descriptor fast
path that dropped all but the first sub-GEMM, producing zeroed slices of
`dx` on
the (G=1, stride>1, 2D, NumDTensor=0) intersection. Restore correctness
without giving up the perf gains PR #6208 delivered on stride=1 shapes.

## Technical Details

- Tighten the flat-descriptor fast-path gate to require
`arg.gemms_count_ == 1` (i.e. a single sub-GEMM per dispatch — its
original purpose). For stride > 1, the implicit GEMM is split into
`gemms_count_` sub-GEMMs whose output cells tile `dx` disjointly;
routing them through the flat path required dropping all but the first,
which was the source of the bug.
- Stride > 1 now falls through to the existing grouped CShuffle path,
which packs all sub-GEMMs into one descriptor array and walks them
on-device in a single kernel launch. This is the pre-PR-6208 production
path; correctness is established and per-dispatch launch count is
minimised.
- Add regression coverage for the (G=1, stride>1, 2D, NumDTensor=0)
intersection in
`test/grouped_convnd_bwd_data/test_grouped_convnd_bwd_data.cpp` with
`gemms_count` ∈ {4, 9, 36}. Pre-existing cases did not hit this
intersection (all stride>1 cases used G=2; all G=1 cases used stride=1),
which is why PR #6208's regression slipped past CI.

## Test Plan

- `ctest -L SMOKE_TEST -R 'grouped_convnd_bwd_data'` on gfx942 (smoke
tier — runs on every PR via `smart_build_and_test.sh`).
- End-to-end verify (`verify=1`) via
`example_grouped_conv_bwd_data_xdl_fp16` on stride 1/2/3/6 shapes
including the original ALMIOPEN-1959 case and a cross-bucket
(`gemms_count=36`) case spanning two `MaxGroupedGemmGroupsNum=32`
buckets.
- ckProfiler A/B sweep on MI300X (gfx942) toggling the flat-path gate
via an environment variable: full kernel-family enumeration, winning
kernel + its avg_time reported under each gate. 33/41 shapes completed
before the sweep was stopped; the remaining 8 were the largest
i2v/synthetic shapes where ckProfiler exceeded its 300s per-shape
enumeration budget (not relevant to the verdict).

## Test Result

### Correctness

| Test | Result |
|---|:---:|
| `test_grouped_convnd_bwd_data` (12 type parameterizations × Test2D,
includes 3 new regression shapes) | **12/12 PASSED** in 14.18 s |
| `test_grouped_convnd_bwd_data_interface` (API checks) | **PASSED** in
0.28 s |
| ALMIOPEN-1959 stride=2 (`verify=1`) | **PASSED** |
| stride=1 K3 (`verify=1`) | **PASSED** |
| stride=3 K3 `gemms_count=9` (`verify=1`) | **PASSED** |
| stride=6 K6 `gemms_count=36` cross-bucket (`verify=1`) | **PASSED** |

### Performance (ckProfiler A/B on gfx942 / MI300X)

Comparing the **post-fix gate** (flat path only when `gemms_count_==1`,
column "B") vs the **inner-loop variant** that keeps the flat path on
stride>1 (column "A") across 25 stride>1 shapes where production picks
a `_v1` instance (so the gate actually fires):

| Stride | Shapes | A wins | Tie | B wins | Notes |
|:------:|:------:|:------:|:---:|:------:|---|
| 1 (sanity, gate moot) | 3 | 0 | 3 | 0 | gate doesn't differentiate — A
== B as expected |
| > 1 (gate fires) | 25 | **0** | 11 | **14** | B wins +6% to +32%; A
never wins |

Highlights from the firing-gate cases:

| Shape (G=1, stride=2 unless noted) | A ms | B ms | B vs A |
|---|---:|---:|---:|
| ALMIOPEN-1959 (N=16, K=256, C=128, 5×5, 40×175) | 0.183 | 0.171 | **B
+6%** |
| Retinanet-L61 (N=32, K=C=256, 3×3, 25×25) | 0.054 | 0.045 | **B +17%**
|
| i2v-010 (N=1, K=C=384, 3×3, 277×209) | 0.174 | 0.125 | **B +28%** |
| Synthetic 50×50 K3 N=32 K=C=256 | 0.131 | 0.088 | **B +32%** |

Why B wins everywhere the gate fires: for `gemms_count = N`, the flat
path needs N kernel launches (one per sub-GEMM), while the grouped path
loops over the same N sub-GEMMs on-device in 1 launch. The (N−1) ×
launch-tax is a structural disadvantage A can't recover from.

### Diff

| File | Lines |
|---|---:|
|
`include/.../device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp`
| +14 / −8 (one extra condition + expanded dispatch comment) |
| `test/.../test_grouped_convnd_bwd_data.cpp` | +9 / −0 (3 new shapes) |

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-27 09:59:14 +03:00
assistant-librarian[bot]
6181eb2adf [rocm-libraries] ROCm/rocm-libraries#4279 (commit 5b3f4b7)
[CK_TILE] Stream-K XCD remapping (#4279)

## Proposed changes

This PR adds support for XCD remapping as detailed in this
[document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A).
On gfx942, workgroups are typically scheduled round-robin across XCDs,
which can lead to poor locality. We will use a remapping to assign
workgroups to contiguous tiles in the XCDs improving the locality and
the cache hit rate. This is done through a function that computes this
contiguous mapping from this
[PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have
added to the StreamKTilePartitioner. This will require minimal changes
to the Stream-K algorithm, only requiring a remap at the time the
workgroups are partitioned. Through this approach we can improve the
data locality by improving cache hits therefore closing performance gaps
that are seen with the default scheduling. There have been unit tests
added to verify the function in isolation. This is an optimization that
is not specialized to just Stream-K GEMM and can be applied across GEMM.

Note: This only applies to the gfx942 as they introduce the XCDs.

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged

---
🔁 Imported from
[ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652)
🧑‍💻 Originally authored by @arai713

---------

Co-authored-by: Astha <astha.rai713@gmail.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>
2026-05-26 09:43:03 -07:00
Aviral Goel
0df3523ef1 [rocm-libraries] ROCm/rocm-libraries#6807 (commit ddda8ac)
[CK_TILE] Add save_matrix_txt() and extract HostTensor I/O to free functions (#6807)

## Summary
- Extract `loadtxt`, `savetxt`, and `save_matrix_txt` from `HostTensor`
member functions into standalone free functions in `host_tensor_io.hpp`
(Single Responsibility Principle)
- Add `save_matrix_txt()` for writing 2D tensors to space-separated text
files with configurable output limit (default 256x256, pass 0 to dump
all)
- Supports float, int, and int8_t output formats via a `dtype` parameter
- Validate dtype early and throw on unsupported values in all three
functions
- Update callers in `15_fused_moe/main.cpp` to use free function syntax
2026-05-26 11:07:18 -04:00
Anton Gorenko
66d6714376 [rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd)
[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388)

## Motivation

Improve precision of mxfp4 without performance penalties.

## Technical Details

Since performance of scale MFMAs is the same when neither A nor B is
fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the
second GEMM, while types of Q, K, V stay the same.
This allows to improve overall precision significantly because fp6 has
32 non-negative values used for P quantization compared to just 8 values
for fp4.

It was found that there is a compiler bug with
`__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in
LCOMPILER-561) but a workaround seems to fix all failing instances.

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-26 06:55:17 -07:00
Yung-sheng Tu
760f9e1d0a [rocm-libraries] ROCm/rocm-libraries#7104 (commit 0fab8d8)
[CK TILE] Unification Work – Add MFMA specialisations for `fp64_t` (#7104)

## Motivation

This PR adds two specialisations related to `fp64_t`.

## Technical Details

This adds two new specialisations for MFMA dense builtins, and adjusts
ABLayout and CLayout to L{K1BM} and L{M1BN}.

## Test Plan

All the new wrappers were added to the test suite in
test_amdgcn_mma_layout.inc.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-26 10:49:36 +00:00
Yi DING
6a9c03f692 [rocm-libraries] ROCm/rocm-libraries#7450 (commit 402dbad)
[CK_TILE] Use Persistent Scheduling for FMHA BWD Group Deterministic (#7450)

## Motivation

FMHA BWD group-mode deterministic currently uses a non-persistent
scheduler: each `(batch, head, K-row)` work-item is launched as its own
block, with no work-stealing across CUs. On uneven workloads (varlen,
GQA, many heads with
few K-rows) this leaves CUs idle and forces a larger dq_acc workspace
than necessary.

This PR ports the persistent + deterministic scheduling already used in
batch mode to group mode: a fixed-grid kernel that pre-computes per-CU
work ranges on the host and uses sparse dq_acc slot indexing so multiple
K-rows handled
by the same CU share one accumulator slot via intra-CU atomic adds.

Stacked on #7331; merge that first.

## Technical Details

Single file changed: `ops/fmha/kernel/fmha_bwd_kernel.hpp`.

A new `kUsePersistent` path is added to the group-mode deterministic
kernel, mirroring the batch-mode persistent scheduler. The host
pre-computes a fixed per-CU partition of the total `(batch, head,
K-row)` work and packs it into
`cu_states[]` so the GPU consumes it in a single launch. Host
preparation happens in four steps:

1. Build per-batch `seqstart` prefix sums.
2. Fill per-batch `(sq_w, nc)` with a placeholder `nsplits` (bumped in
step 3).
3. Two-pointer scan over CUs to fill `cu_states[c]` (`isplit`,
`head_start`, `c_start`, `w_lo`, `w_hi`), accumulating `nsplits[b]` as
`max(cs->isplit + 1)`.
4. Compute compact per-batch dq_acc offsets from the finalized
`nsplits`.

`isplit` is the sparse dq_acc slot index — one CU's multi-K-row writes
share slot `ceil(wc_start / denom)`, enabling intra-CU atomic
accumulation instead of one slot per K-row.

`denom = max(sq_w, target_w)`, splitting two regimes:

- `target_w >= sq_w` (large work): `denom = target_w`, intra-CU atomic
optimization engaged.
- `target_w < sq_w` (sub-K-row sharding, multiple CUs sharing one
K-row): `denom = sq_w` collapses to per-K-row indexing (`= c_start`),
keeping `isplit ∈ [0, nc-1]` and matching the `nsplits_max =
ceil(s_k/kN0) = nc` upper bound that #7331's
`GetWorkspaceDeviceSizeUpperBound` assumes for group+det.

`isplit` is additionally clamped to `nc-1` to absorb empty CUs
(rounded-up `wc_start` past the last K-row); they don't write dq_acc on
GPU so the slot value is harmless.

`nsplits[b]` is accumulated dynamically in step 3 rather than via a
closed form so it tightly matches the actual sparse slots used; step 4
(offsets) follows step 3 since offsets now depend on the dynamic
`nsplits`.

Group mode also allows batches with `seqlen_q == 0`. The persistent
scheduler skips them on the dQ path (no work) but dK/dV are still
zero-filled.

## Test Plan

Built `tile_example_fmha_bwd` with receipt 5 (fp16, no-bias, no-dropout,
`dpad == dvpad`, group + batch) on gfx950 (MI355X).

- 8-case smoke (shapes that exercise the sub-K-row regime).
- 44-case sweep covering: mask 0/1/2, GQA, var seqlen, `d != d_v`,
extreme
  small seqlen / `nc=1`, CU >> work, huge batch, batch-mode regression.
- 12-case perf comparison vs the non-persistent baseline (warmup=10,
  repeat=50).

## Test Result

- All 8 + 44 cases `valid:y`.
- Perf: ±5% noise, average -0.4% across the 12 cases — neutral.
- Batch-mode deterministic / non-deterministic regression unchanged.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-26 10:01:54 +08:00
Michal Kulikowski
8de4cb72fb [rocm-libraries] direct push (commit 49b73ad)
[CK][CK_TILE] POC for Instruction Cache prefetch.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2026-05-25 11:26:26 +02:00
chris-tsiaousis-hpc
c7fac341de [rocm-libraries] ROCm/rocm-libraries#4871 (commit 7d4c040)
[CK] Decouple EpilogueArgs from GridwiseGemm implementation (#4871)

This is duplicate of #4537. I could not re-open it since te target
branch got deleted and could not change the target branch since it was
closed... :)

## Motivation

Right now, all the Epilogues structs are declared inside the base
gridwise struct. They should be independent of it and the specialization
of the selected Epilogue Type should be declared within the the kernel
function.

## Technical Details

All Epilogue structs depend on template parameters that are known to the
base Gridwise Gemm struct. In this PR, we export them to be used
independently by any struct that might need to extract them. This
approach will serve the decoupling purposes for the Epilogues, but also
enable future constructs to use and expand this approach. See
30e2a4c01b64bdea68857c7badd9d7cffbf1adb9.

Right now an issue that arises is that when implementing a new Epilogue
Type, the developer is not forced to decide where this struct should/can
be used or not. To fix this I propose defining an `enum struct
EpilogueType` that will be used to fetch the Epilogue specialization
through a helper struct. See a943ac8d130e12d6843715b322181186e54ba15c.
Note that all the instantiation details will stay in this helper struct.
Also note the static assertion in the else statement.

## Test Plan

Test with existing CI, as nothing is added/removed.

## Test Result

All relevant existing CI tests should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-05-22 18:39:01 +00:00
JP-Fernando
74bc86240b [rocm-libraries] ROCm/rocm-libraries#5647 (commit 490437a)
[CK Tile] Add gemm universal preshuffle to MX GEMM  (#5647)

## Motivation

Add gemm universal preshuffle support to existing MX GEMM pipeline.

The straightforward way to do this is to port the `mx_flatmm` pipeline
to the existing `gemm_mx` framework.

## Technical Details

The `mx_flatmm` pipeline was not deleted, to allow for
back-compatibility.

## Test Plan

Add `preshuffle` option to example: `tile_example_mx_gemm`.

Add new configurations with enabled preshuffle to the existing
`test/ck_tile/gemm_mx` tests.

## Test Result

Example and tests were successful on `gf950` architecture in the `Alola`
cluster.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
2026-05-22 16:07:53 +02:00
Bartłomiej Kocot
ebb97044f4 [rocm-libraries] ROCm/rocm-libraries#7664 (commit de5d6b1)
Revert "[CK] Enable grouped conv bwd data to match non-grouped perf" (#7664)

## Motivation

Incorrect results has been introduced for some conv bwd cases.

## Technical Details

This reverts commit 33424f65346d6330d0fd94b5a4e6f843f24e52c3.

## Test Plan

CI

## Test Result

Pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

ALMIOPEN-1959
2026-05-22 12:28:49 +00:00
Wojciech Laskowski
3ea9ce7e37 [rocm-libraries] ROCm/rocm-libraries#6567 (commit 753c7a8)
[CK Tile] Adding WMMA wrappers for sparse builtins (#6567)

## Motivation

This PR is part of the [WMMA/MFMA] unification work. It's the third of
the series of PRs (after
https://github.com/ROCm/rocm-libraries/pull/5801 and
https://github.com/ROCm/rocm-libraries/pull/6014) that add all the
necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse
WMMA intrinsics.

## Technical Details

This change adds new specializations for WMMA sparse builtins. In total,
we add 8 WMMA builtins.

## Test Plan

All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.

## Test Result

Test pass locally, waiting for the CI.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-22 13:34:33 +02:00
Illia Silin
e02c566795 [rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)
[CK] upgrade CI to rocm7.13 as default compiler (#7612)

## Motivation

Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2026-05-22 02:43:50 +00:00
kensclin
fc2862d712 [rocm-libraries] ROCm/rocm-libraries#6846 (commit 377def4)
[CK_TILE] Add fmha forward hdim 256 support (#6846)

## Motivation

Enable Composable Kernel FMHA forward kernel for **hdim=256 BF16** on
AMD gfx950 (MI350X). Prior to this change the (256, 256) head-dim
configuration either failed to compile, was filtered out by the
compatibility rules, or produced incorrect kernel output due to an LDS
layout accounting bug.

## Technical Details

  Four files changed, all to enable hdim=256 BF16 on gfx950.

- **`fmha_fwd.py`** — Allow `(256, 256)` in gfx950 compatibility rule;
set `(256,256)` BF16 tile to `M0=128, N0=64` (the LDS-feasible shape on
gfx950); emit minimal valid instance set for d=256 to bound compile
time.

- **`fmha_fwd_kernel.hpp`** — Gate Prefill launch path off for d=256
(`PrefillCase = kM0 > 64 && kQKHeaddim < 256`); the double-buffer
Prefill variant overflows the 160 KB LDS budget.

- **`trload_policy.hpp`** — **Critical correctness fix**: the LDS layout
accounting in `GetSmemSize` was wrong (`max(Q, K+S+V)` instead of
`max(Q, K) + V + S`), under-allocating LDS and silently corrupting d=256
output (~2% wrong values).

- **`trload.hpp`** — Thread `LoadOnce=true` through all d=256 K-LDS
descriptors so the compiler picks the matching XOR swizzle period;
recompute the S-tile LDS offset to match the corrected `GetSmemSize`
formula.

## Test Plan

Built and ran `tile_example_fmha_fwd` on gfx950 (MI350X) with the
canonical d=256 BF16 configurations:

  ```bash
  cd build && ninja tile_example_fmha_fwd
./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=1 -h=32 -h_k=2
-s=1024 -s_k=1024 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10
-kname=1 -v=1
./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=8 -h=32 -h_k=2
-s=16384 -s_k=16384 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3
-repeat=10 -kname=1 -v=1
  ```

## Test Result

  ```bash
-b=1 -s=1024
[bf16|batch|bhsd] b:1, h:32/2, s:1024/1024, d:256/256, scale_s:0.0625,
bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r,
fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink,
0.058 ms, 298.42 TFlops, 618.68 GB/s, valid:y

-b=4 -s=16384
[bf16|batch|bhsd] b:8, h:32/2, s:16384/16384, d:256/256, scale_s:0.0625,
bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r,
fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink,
42.797 ms, 822.18 TFlops, 106.63 GB/s, valid:y
  ```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com>
2026-05-22 01:57:41 +08:00
JiaLuo-CAN
5ff7497fa7 [rocm-libraries] ROCm/rocm-libraries#7537 (commit 07123f4)
[CK Tile] Fix Grouped Gemm quant mixed precision (#7537)

<Migrate from Internal repo PR>
test_ck_tile_grouped_gemm_quant_tensor would fail for mixed FP8/BF8
cases:
std::tuple<Row, Col, Row, FP8, F32, BF8, F32, F32, F16, TensorQuant,
False, True, False>,
std::tuple<Row, Col, Row, BF8, F32, FP8, F32, F32, F16, TensorQuant,
False, True, False>

GFX1250 would fail with incorrect results, GFX950 would fail when
compiling BF8+FP8 and give incorrect results for FP8+BF8.
The issue is due to the wrong ComputeDataType selection.
The fix is to consider original ADataType and BDataType even when
ComputeDataType is not void. For compiling error on gfx950, the bf8,
fp8, 16x16x32 warp Gemm is added.
2026-05-21 08:36:23 -07:00
JP-Fernando
e7798e9560 [rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac)
Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112)

## Motivation

The goal of this work is to apply XOR shuffle (swizzle) to the current
`comp_async` GEMM pipeline and the `gemm_mx` pipeline.
XOR swizzling has been helpful to avoid LDS bank conflicts, as data are
redistributed across LDS banks, such that simultaneous threads accessing
different rows land on different LDS banks.

## Technical Details

A similar approach to the work in the existing eight-waves pipeline was
followed.
Currently, XOR swizzle support is available for FP8 and BF8 types.
FP4 support is also available for MX GEMM.
Should the types not match, or should the async vector width be of an
unsupported size, then the pipeline falls through to the previously
existing ('unswizzled') path.

## Test Plan

Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM
pipeline.
Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for
the MX GEMM pipeline.

## Test Result

The tests passed successfully in the `Alola` cluster with MI350
hardware.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-05-21 09:36:41 +02:00
Wojciech Laskowski
275629fe34 [rocm-libraries] ROCm/rocm-libraries#6014 (commit 2f8259d)
[CK Tile] Adding MFMA wrappers for dense builtins (#6014)

## Motivation

This PR is part of the [WMMA/MFMA] unification work. It's the second of
the series of PRs (after #5801) that add all the necessary MMA builtins
as `amdgcn_mma` structs. This PR focuses on dense MFMA intrinsics.

## Technical Details

This change adds new specializations for WMMA dense builtins. In total,
we add 55 MFMA builtins.

## Test Plan

All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.

## Test Result

Test pass locally, waiting for the CI.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-21 09:05:19 +02:00
JH-Leon-KIM-AMD
720ceb6500 [rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f)
[CK Tile] Support multi-vector reads in static encoding patterns  (#7528)

## Motivation

The thread-raked / warp-raked / block-raked static tile distribution
patterns in `ck_tile` silently produce wrong results when the contiguous
tile dimension is larger than `warp_size * vector_size`, because the
encoding has no per-thread iteration dimension along X.

Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in
`ConvConfigComputeV3`, the grouped convolution backward-weight example
reports about 50 percent wrong values, with errors starting exactly at
the `X0*X1 = 64` boundary. The second pass over the contiguous dim is
never performed.

This PR extends the encoding so multi-vector reads in the contiguous
tile dimension are supported, while keeping every existing call site
bit-for-bit identical.

## Technical Details

Three files changed.

### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp`

Add a per-thread X iteration dimension in all three raked
specializations:

- `X0 = min(warp_size, XPerTile / X1)` — threads in X dim
- `X1 = min(LargestVec, VecSize)` — vector size per access
- `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new)

`X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both
`make_2d_static_tile_distribution()` and
`make_shuffled_2d_static_tile_distribution()`.

The new encoding places `X2` in the middle of the Ys iteration list,
which preserves reverse symmetry between the regular `<..., X2, X1>` and
shuffled `<X1, X2, ...>` encodings.

Patterns updated: `thread_raked`, `warp_raked`, `block_raked`.

### 2. `include/ck_tile/core/tensor/transpose_tile.hpp`

Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch
alongside the existing `NDimY == 2` branch. The original branch is
byte-for-byte unchanged.

Both branches dispatch to the same `transpose_tile2d_impl_in_thread`,
whose body has always been NDimY-generic (iterates with `static_for<0,
NDimY, 1>` and `number<NDimY>{}`).

### 3.
`experimental/grouped_convolution_tile_instances/generate_instances.py`

Removed the two now-obsolete skip guards in `parse_bwd_weight_instances`
and `parse_bwd_data_instances`:

```python
if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector):
    print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.")
    continue
```

Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating,
irregular-load shapes, scalar-per-vector > tile size) are kept
untouched.

### Compatibility

Strict. Every existing caller has `X2 == 1` and therefore hits the
original encoding path verbatim. No upstream config or pipeline behavior
changes.

## Test Plan

The grouped convolution example is the natural exerciser since
`GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for
both A and B tiles, and all three conv directions share the same
`ConvConfigComputeV3`.

For each test below we ran:

```
./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_fwd        [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_bwd_data   [-prec={fp16,bf16}]
```

with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both
code paths:

| Test | M / N / K | VecA/B/C | A path | B path | dtype |

|------|-------------|----------|------------|----------------|-------------|
| T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 |
| T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 |
| T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 |
| T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)|
| T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3
dir) |

A larger T4a (256/256/128) was attempted to stress both A and B with
X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128
KB > 64 KB limit), independent of this PR.

For the generator change we ran:

```
python3 generate_instances.py --mode profiler --direction all
```

and verified `Skipping instance ... with multiple warps per continous
tile dim` no longer appears (count went from non-zero to 0); other skip
categories are unchanged.

`clang-format-18` was applied to both modified `.hpp` files (matches the
repo's `.clang-format`).

## Test Result

- T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`.
Confirms existing callers are unaffected.
- T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3
encoding + transpose branch.
- T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3
`correct`.
- T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs
`correct`.
- Generator: 0 `multiple warps per continous tile dim` skips remaining;
other skips unchanged.

Sample run output (T4b, bf16, bwd_data):

```
shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32
pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0
Vector size A: 1, Vector size B: 1, Vector size C: 1
0.934907 ms, 8.34683 TFlops, 34.3178 GB/s
Relative error threshold: 0.00390625 Absolute error threshold: 0.25
The CPU verification result is: correct
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-20 17:25:22 +03:00
Kiefer van Teutem
b5f8bef97f [rocm-libraries] ROCm/rocm-libraries#6088 (commit 6ac353c)
[CK Tile][MFMA/WMMA unification] Add support for packed datatypes (tiny types) (#6088)

## Motivation
This MR makes all the changes required for the unified architecture to
be able to deal with packed datatypes i.e. int4, fp4, fp6, and bf6. The
crux is that layout parameters should be interpreted as describing the
pure mathematical matrix fragments, while the ext_vectors and tile
distribution encodings describe everything in terms of packed datatype
units. This matches how packed types are dealt with in ck_tile and
should play nicely with the load and store tile ops once we integrate
the unified framework into CK tile.

The bf6 datatype was added to CK tile in the form of pk_bf6x16_t and
pk_bf6x32_t, which did not exist before.

The ext_vector implementations of pk_fp6x16_t and pk_bf6x16_t (vec size
1 and 2) were extended to make the subscripting operator work as
expected.

The layout test was adapted to be compatible with all packed datatypes,
and all new intrinsics were added to the test.

This MR adds ALL intrinsics across ALL architectures which use packed
datatypes, as well as ALL scale intrinsics:

mfma_scale_f32_16x16x128_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6,
BF6xBF6)
mfma_scale_f32_32x32x64_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6,
BF6xBF6)
wmma_i32_16x16x16_iu4_w32
wmma_i32_16x16x16_iu4_w32_gfx12
wmma_i32_16x16x32_iu4_w32_gfx12

## Testing
All intrinsics were tested on all architectures.
2026-05-20 12:36:13 +00:00
Aviral Goel
458dd0ac4c [rocm-libraries] ROCm/rocm-libraries#7130 (commit 9e1e065)
[CK_TILE] Redesign LDS store API with pre-computed window coordinates (+15% MI355X, +6% MI300X) (#7130)

## Summary

- Redesign the LDS store API to separate window creation from memory
transfer
- Add `MakeDistributedLdsStoreWindow` factory, `LocalStore` (fast path),
and `LocalStoreWithCoordRecompute` (slow path) to the pipeline base
class
- Convert CompV3 as the reference implementation
- Document the slow/fast path distinction across core tensor headers

## Motivation

`LocalPrefill` hides a performance cliff: when given a bare
`tile_window_with_static_lengths`, it silently reconstructs
`tile_window_with_static_distribution` on every call — paying
significant VALU overhead (~96 for typical configurations) for XOR
coordinate computation. The cost is invisible at the call site.

The new API makes the cost explicit via three verbs:

| Verb | Method | Cost | When to use |
|------|--------|------|-------------|
| **Create** | `MakeDistributedLdsStoreWindow(bare, dstr)` | VALU (once)
| Before hot loop, when VGPR budget allows |
| **Store (fast)** | `LocalStore(precomputed_window, tensor)` | 0 VALU
for coords | Pre-computed window available |
| **Store (on-the-fly)** | `LocalStoreWithCoordRecompute(bare, tensor)`
| VALU per call | VGPR budget tight, or one-shot stores |

Both `LocalStore` and `LocalStoreWithCoordRecompute` enforce correct
window types via `static_assert`. `LocalPrefill` is retained for
backward compatibility (69 call sites across 6 pipeline files).

## Performance

### 86 Shapes, CompV3_2 (128×128 tile), fp16, RCR layout

**gfx942 (MI300X): 86/86 improved, 0 regressions. Average gain: +6.2%**
**gfx950 (MI355X): 85/86 improved, 1 neutral, 0 regressions. Average
gain: ~+15%**

<img width="2777" height="1178" alt="pr7130_perf_chart"
src="https://github.com/user-attachments/assets/b2f5c406-eb20-469d-8da6-dd608c28fbcc"
/>

| Shape (MxNxK) | Source | gfx942 | gfx950 |
|---|---|---|---|
| 22016x256x4096 | llama2_7b_fc1 | +5.3% | +11.4% |
| 22016x512x4096 | llama2_7b_pfill | +5.9% | +10.9% |
| 4096x512x22016 | llama2_7b_pfill | +7.6% | +28.5% |
| 22016x1024x4096 | llama2_7b_pfill | +6.1% | +10.1% |
| 4096x1024x22016 | llama2_7b_pfill | +7.4% | +17.2% |
| 22016x4096x4096 | llama2_7b_pfill | +5.2% | +9.3% |
| 4096x4096x22016 | llama2_7b_pfill | +6.0% | +9.3% |
| 4096x4096x4096 | llama2_7b_pfill | +5.7% | +10.6% |
| 28672x256x4096 | llama3_8b_fc1 | +5.4% | +12.2% |
| 28672x512x4096 | llama3_8b_pfill | +4.9% | +6.4% |
| 4096x512x28672 | llama3_8b_pfill | +7.4% | +1.5% |
| 28672x2048x4096 | llama3_8b_pfill | +4.9% | +8.6% |
| 4096x2048x28672 | llama3_8b_pfill | +6.4% | +8.4% |
| 28672x8192x4096 | llama3_8b_pfill | +5.4% | +8.0% |
| 7168x1024x8192 | llama70b_pfill | +6.6% | +10.8% |
| 8192x1024x7168 | llama70b_pfill | +6.4% | +11.4% |
| 7168x4096x8192 | llama70b_pfill | +6.2% | +9.6% |
| 16384x256x4096 | bloom_fc1 | +6.4% | +20.3% |
| 16384x512x4096 | bloom_fc1 | +5.8% | +8.5% |
| 16384x1024x4096 | bloom_fc1 | +6.0% | +10.9% |
| 16384x2048x4096 | bloom_fc1 | +5.3% | +10.1% |
| 16384x3072x4096 | bloom_fc1 | +5.5% | +8.8% |
| 16384x4096x4096 | bloom_fc1 | +5.7% | +8.8% |
| 4096x256x16384 | bloom_fc2 | +7.8% | +33.6% |
| 4096x512x16384 | bloom_fc2 | +7.5% | +31.6% |
| 4096x1024x16384 | bloom_fc2 | +7.1% | +17.1% |
| 4096x2048x16384 | bloom_fc2 | +6.9% | +11.0% |
| 4096x3072x16384 | bloom_fc2 | +6.8% | +11.0% |
| 4096x4096x16384 | bloom_fc2 | +6.7% | +10.3% |
| 12288x256x4096 | bloom_inproj | +6.7% | +22.0% |
| 12288x512x4096 | bloom_inproj | +6.2% | +9.8% |
| 12288x1024x4096 | bloom_inproj | +5.9% | +12.4% |
| 12288x2048x4096 | bloom_inproj | +5.8% | +10.1% |
| 12288x3072x4096 | bloom_inproj | +5.4% | +10.1% |
| 12288x4096x4096 | bloom_inproj | +5.7% | +9.1% |
| 250880x256x4096 | bloom_logits | +2.6% | +0.5% |
| 4096x256x4096 | bloom_outproj | +7.1% | +28.4% |
| 4096x512x4096 | bloom_outproj | +6.8% | +27.4% |
| 4096x1024x4096 | bloom_outproj | +6.5% | +21.3% |
| 4096x2048x4096 | bloom_outproj | +5.9% | +13.1% |
| 4096x3072x4096 | bloom_outproj | +5.9% | +12.0% |
| 16x1536x7168 | deepseek | +7.7% | +34.7% |
| 32x1536x7168 | deepseek | +7.7% | +34.9% |
| 64x1536x7168 | deepseek | +7.6% | +31.3% |
| 128x1536x7168 | deepseek | +7.6% | +25.8% |
| 256x1536x7168 | deepseek | +7.7% | +27.9% |
| 512x1536x7168 | deepseek | +7.6% | +29.1% |
| 1024x1536x7168 | deepseek | +7.3% | +28.8% |
| 2048x1536x7168 | deepseek | +6.9% | +20.5% |
| 4096x1536x7168 | deepseek | +6.3% | +11.0% |
| 8192x1536x7168 | deepseek | +6.2% | +11.3% |
| 16384x1536x7168 | deepseek | +6.0% | +9.1% |
| 20480x1536x7168 | deepseek | +4.8% | +9.3% |
| 16x3072x1536 | deepseek | +6.3% | +25.1% |
| 32x3072x1536 | deepseek | +6.4% | +25.3% |
| 64x3072x1536 | deepseek | +6.4% | +24.8% |
| 1024x1024x1024 | square | +5.5% | +18.7% |
| 2048x2048x2048 | square | +6.0% | +19.2% |
| 3584x3584x3584 | square | +5.3% | +11.2% |
| 5120x5120x5120 | square | +6.1% | +10.0% |
| 6144x6144x6144 | square | +5.5% | +9.8% |
| 8192x8192x8192 | square | +6.0% | +8.2% |
| 1024x4608x1024 | midsize | +4.6% | +4.6% |
| 512x18432x512 | midsize | +1.9% | +10.1% |
| 4096x18432x4096 | midsize | +5.8% | +8.8% |
| 320x8192x320 | stablediff | +4.0% | +11.3% |
| 640x2048x640 | stablediff | +4.5% | +14.0% |
| 320x8192x1280 | stablediff | +5.6% | +20.1% |
| 1x1280x8192 | skinny_m1 | +7.7% | +35.3% |
| 1x8192x1024 | skinny_m1 | +6.0% | +20.3% |
| 1x7168x8192 | skinny_m1 | +7.7% | +36.6% |
| 1x8192x3584 | skinny_m1 | +7.3% | +27.9% |
| 1x13312x6656 | skinny_m1 | +7.6% | +30.3% |
| 1x13312x16384 | skinny_m1 | +7.8% | +4.2% |
| 1x16384x6656 | skinny_m1 | +7.5% | +28.7% |
| 1x16384x16384 | skinny_m1 | +7.7% | +2.3% |
| 16x4096x4096 | skinny_m16 | +7.4% | +31.9% |
| 16x22016x4096 | skinny_m16 | +7.5% | +26.5% |
| 16x28672x4096 | skinny_m16 | +7.0% | +15.1% |
| 16384x1280x8192 | skinny_m16 | +5.6% | +8.7% |
| 16384x8192x1024 | skinny_m16 | +4.5% | +8.8% |
| 2048x4096x2048 | mixed | +4.7% | +9.0% |
| 4096x2048x8192 | mixed | +6.8% | +11.0% |
| 8192x4096x4096 | mixed | +5.2% | +10.0% |
| 1x4096x4096 | mixed | +7.4% | +32.4% |
| 1024x1024x4096 | mixed | +7.1% | +27.4% |

### ISA Hot Loop Diff (LBB1_32, per K-iteration, gfx942)

| Metric | Baseline | Optimized | Delta |
|--------|----------|-----------|-------|
| Total VALU | 621 | 500 | **-121** |
| VGPR / SGPR | 512 / 96 | 512 / 96 | unchanged |

### Hardware Counters — Instruction Mix (gfx950, rocprofiler-compute)

Profiled on MI350X, shape 4096×256×16384 (bloom_fc2). Instruction counts
are deterministic hardware counters.

| Metric | Baseline | Optimized | Δ |
|--------|----------|-----------|---|
| **VALU instructions/kernel** | 4,642,473 | 987,958 | **−78.7%** |
| **INT32 VALU** | 2,592,786 | 541,129 | **−79.1%** |
| Instructions / wavefront | 39,178 | 24,400 | −37.7% |
| VGPRs (avg) | 98 | 90 | −8% |
| **MFMA instructions** | 2,059,702 | 2,059,702 | **0%** |
| **LDS instructions** | 1,564,891 | 1,564,891 | **0%** |
| **VMEM instructions** | 520,996 | 520,996 | **0%** |

MFMA as fraction of total instructions: **30.7% → 67.5%**. Eliminating
~3.65M redundant INT32 VALU instructions (XOR coordinate recomputation
per K-iteration) leaves the scheduler more headroom for MFMA dispatch,
directly explaining the benchmark gains.
2026-05-19 20:22:37 -04:00
Enrico Degregori
9565ca21ec [rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2)
[CK Tile] Eight Waves pipeline for MX GEMM (#5552)

## Motivation

Integrate Eight Waves pipeline in MX GEMM

## Technical Details

 - EightWaves pipeline:
- Add pipeline, policy and block gemm (internally using existing
implementation used by GEMM and ABQuant)
   - Extend support of EightWaves policy for FP4 (packed types)
 - Async pipeline:
- Fix pipeline with packed scales (requires MRepeat and NRepeat to be
contiguous)
- block gemm specific for MX GEMM is defined because distribution
encodings have changed
 - CShuffle:
- Add new functionality to support MRepeat and NRepeat contiguous
(defined by `TilesPacked`)
 - Examples:
- Refactor examples to easily switch different configurations (similar
to GEMM universal)
- Scales values generated consistently with other microscale
implementations in CK Tile
   - Add configuration for EightWaves pipeline
 - Tests:
   - Unify existing FP8 and FP4 tests
   - Add tests for EightWaves pipeline
- Scales values generated consistently with other microscale
implementations in CK Tile

Note: FP6 support for MX GEMM was added later and the support for the
Eight Waves pipeline will be done in following PR

## Test Plan

Add new pipeline to tests: `test_ck_tile_mx_gemm_async` for both FP4 and
FP8

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-19 11:53:19 -07:00
Aaryaman Vasishta
fad83d9c90 [rocm-libraries] ROCm/rocm-libraries#7016 (commit 2b73c00)
[CK] Fix RDNA3 FMHA tile-load paths (#7016)

## Summary

Fix CK tile FMHA paths needed for RDNA3/RDNA4 targets.

## Details

This PR addresses RDNA-specific issues hit while enabling xFormers CK
FMHA on gfx11/gfx12:

- On RDNA3, update FMHA P tile handling so the layout consumed by the
second GEMM matches the WMMA path.

## Testing

Validated downstream with xFormers CK/FMHA on gfx1201/gfx1151.

```text
pytest --import-mode=importlib -q \
  tests/test_mem_eff_attention.py::test_forward \
  tests/test_mem_eff_attention.py::test_backward \
  tests/test_mem_eff_attention.py::test_dropout_ck

3844 passed, 5244 skipped, 26 warnings

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-05-19 06:41:36 -07:00
Yung-sheng Tu
5169cd14a1 [rocm-libraries] ROCm/rocm-libraries#7543 (commit 2b735ff)
Fix for #6207 (#7543)

## Motivation

PR #6207 introduces an error. This PR is the fix of it.

## Technical Details
Adds a path for GFX1250 in `to_string`

## Test Plan

Test has already included.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-19 00:54:46 +00:00
Johannes Graner
3727d5220a [rocm-libraries] ROCm/rocm-libraries#5652 (commit 7dc7d1d)
[CK Conv] Wavelet gemm pipeline for bwd_weight convolution (#5652)

## Motivation

In the current CShuffleV3 backward weight kernel, the in-kernel
conv-to-GEMM transform generates significant INT32 VALU pressure per
MFMA instruction. On VALU-heavy shapes (e.g., G=1, 3×3, C=256), these
index computation ops compete with MFMA for VALU issue slots, creating a
bottleneck that cannot be resolved by pipeline prefetching alone.

This PR adds a wave-specialized ("wavelet") convolution backward weight
kernel that splits workgroup threads into two roles:
- **Load waves**: conv-to-GEMM address computation + global memory loads
+ LDS writes (all VALU/VMEM)
- **Math waves**: LDS reads + MFMA + CShuffle epilogue (no index
computation)

By physically separating the two instruction classes onto different
waves, VALU and MFMA execute on different hardware functional units
without contention.

## Technical Details

**Core kernel (new files):**
- `gridwise_gemm_xdl_waveletmodel_cshuffle_conv_v3.hpp` —
wave-specialized gridwise GEMM for conv bwd weight (2-way split: load +
math)
- `device_grouped_conv_bwd_weight_xdl_waveletmodel_cshuffle_v3.hpp` —
device op following CShuffleV3 patterns; `BlockSize =
TileMathThreadGroupSize` for MFMA wave assignment, `LaunchBlockSize =
TileLoad + TileMath` for kernel launch

**Wave pipeline (modified):**
- `gridwise_gemm_waveletmodel.hpp` — load/math wave pipeline structs
with `sched_group_barrier` scheduling hints to front-load VMEM reads
before address-advance VALU

**Two wave ratios:**
- **(4,4)**: 256 load + 256 math = 512 threads (8 waves). Best on large
shapes.
- **(4,2)**: 256 load + 128 math = 384 threads (6 waves). Best on small
shapes (fewer sync barriers, denser MFMA per math wave).

**Instance coverage (F16 and BF16 symmetric):**

| Ratio | Tiles | Layouts | ConvSpecs |
|-------|-------|---------|-----------|
| (4,4) | M128×N128, M64×N64, M128×N64, M64×N128 | 2D NHWGC, 3D NDHWGC |
Default, Filter1x1Stride1Pad0 |
| (4,2) | M64×N64, M128×N64, M64×N128 | 2D NHWGC | Default,
Filter1x1Stride1Pad0 |

**Existing wavelet model fixes:**
- `BlockSize` corrected from `math::max(TileLoad, TileMath)` to
`TileMathThreadGroupSize` in the flat-GEMM wavelet device op and
gridwise kernel

## Test Plan

- `test_grouped_convnd_bwd_weight` GTest: 34 hardcoded test cases
covering 1D/2D/3D, F16/BF16, G=1/2/16, various spatial sizes
- Performance benchmark: all 37 RetinaNet bwd_weight shapes on gfx950

```bash
ninja -C build test_grouped_convnd_bwd_weight
./build/bin/test_grouped_convnd_bwd_weight
```

## Test Result

**Correctness:** 34/34 GTest cases passed (F16/BF16 × 1D/2D/3D ×
Default/Filter1x1Stride1Pad0 × various G/N/K/C combinations).

**Performance:** Wavelet is the fastest overall instance on 12/37
RetinaNet shapes — all G=1, 3×3 convolutions with C=256 (the VALU-heavy
target shapes):

| Shape | Uplift vs best baseline |
|-------|------------------------|
| K=36, 7×7 | 1.91x |
| K=36, 100×100 | 1.60x |
| K=36, 13×13 | 1.43x |
| K=36, 25×25 | 1.38x |
| K=36, 50×50 | 1.38x |
| K=256, 100×100 | 1.24x |
| K=256, 13×13, s=2 | 1.20x |
| K=256, 25×25, s=2 | 1.20x |
| K=256, 7×7 | 1.17x |
| K=256, 13×13 | 1.13x |
| K=2376, 50×50 | 1.05x |
| K=2376, 100×100 | 1.06x |

Where wavelet does not win (25/37): 1×1 convolutions (explicit kernel
does host-side transform), grouped convolutions with small per-group
channels, and shapes where standard CShuffleV3 already amortizes VALU
overhead.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: jakpiase <jakpia21@gmail.com>
2026-05-18 17:46:01 +02:00
JH-Leon-KIM-AMD
9a5d1ea791 [rocm-libraries] ROCm/rocm-libraries#6208 (commit 33424f6)
[CK] Enable grouped conv bwd data to match non-grouped perf via NoShuffle + packed descriptors (#6208)

## Motivation

Improve performance of grouped convolution backward-data kernels to
match non-grouped kernel performance for G=1 cases.

## Technical Details

- Add NoShuffle epilogue path (direct VGPR→Global writes) by setting
`CDEBlockTransferScalarPerVector_NPerBlock = 1`
- Add nongrouped-match instances with optimized BBlockTransfer
parameters for better thread utilization
- Add packed (flat) descriptor path for G=1 2D convolutions, using
simpler tensor descriptors with fewer transform layers to reduce address
computation overhead in the GEMM main loop
- Cherry-pick PR #6090 for fair benchmarking (cache flush, include dX
zeroing cost)

## Test Plan

- Benchmark grouped vs non-grouped kernels on MI300X (589 shapes, BF16)
- Verify correctness with existing conv bwd data tests

## Test Result

| Metric | Before | After |
|--------|--------|-------|
| Mean ratio (grouped/nongrouped) | 1.159 | **1.028** |
| Median ratio | 1.142 | **1.026** |
| Cases within 2% | 26 (4.4%) | **186 (31.8%)** |
| Cases >20% slower | 188 (32%) | **2 (0.3%)** |

NoShuffle + nongrouped-match instances achieve **~2.8% average gap**
with non-grouped kernels (down from ~16%).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: root <root@ctr-cx64-mi300x-4.amd.com>
Co-authored-by: root <root@ctr-cx71-mi300x-01.amd.com>
Co-authored-by: root <root@ctr-cx63-mi300x-21.amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: root <root@gt-ccs-aus-h17-18.cs-aus.dcgpu>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 06:49:50 -07:00
Yung-sheng Tu
3ccb72e761 [rocm-libraries] ROCm/rocm-libraries#6207 (commit cc56378)
[CK TILE] Unification Work – Add `print()` Utility to `MmaOpTraits` (#6207)

## Motivation

It would be useful to have a `print()` utility inside of unification
work's code scope, so that we can print all template params and derived
params of `amdgcn_mma` for easier debugging.

## Technical Details

Adding helper functions and struct to traits, adding `print_flags()` for
each `Default*CtrlFlags`, `amdgcn_target` and `MmaOpTraits` structs, and
adding `print()` for `amdgcn_mma`.

Note: the first commit is **not** in the scope of this PR. This PR
should be merged after https://github.com/ROCm/rocm-libraries/pull/5801
and https://github.com/ROCm/rocm-libraries/pull/5857.

## Test Plan

Adding test in layout test.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-18 13:02:38 +02:00
Bartłomiej Kocot
cc5c79a1e7 [rocm-libraries] ROCm/rocm-libraries#5904 (commit f4e261a)
[CK][CK Tile]  Grouped Conv Backward Weight Streamk instances (#5904)

## Motivation

Add streamk instance to grouped convolution backward weight profiler.

## Technical Details

- New instances for grouped conv backward weight with streamk

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Graner, Johannes <johannes.graner@amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-05-16 10:49:18 +02:00
Bartłomiej Kocot
067e5e0ca4 [rocm-libraries] ROCm/rocm-libraries#6838 (commit ff7a665)
[CK_TILE] Add depthwise conv2d forward kernel (FP16/FP32) (#6838)

## Motivation

CK currently has no kernel optimized for depthwise convolution
(G=C_in=C_out, C=K=1 per group) and existing generic paths perform
poorly for this workload. This PR adds a dedicated depthwise conv
forward kernel in CK Tile.

## Technical Details

Adds a dedicated depthwise conv2d forward op to CK Tile that performs
direct convolution rather than falling back to the generic GEMM path.
The kernel is templatized by filter size, stride, and data type, and
compiled into ~60 instances covering common configurations (kernel
3/5/7/9, stride 1/2, FP16/FP32). Supports both CDNA (gfx942/gfx950) and
RDNA (gfx1100/gfx1200) architectures.

## Test Plan

- [x] Correctness and performance validated on gfx942, gfx950, and
gfx1100, with ckProfiler `grouped_conv_fwd` as baseline.
- [ ] MI300A (gfx942) and gfx1200 validation.

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1137

---------

Co-authored-by: GenDu <Gen.Du@amd.com>
2026-05-15 15:47:55 +02:00
Illia Silin
717f2efef7 [rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)
[CK] add composable kernel support on gfx1250 (#6978)

## Motivation

Add composable kernel support on gfx1250.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
2026-05-15 06:46:51 -07:00
Illia Silin
ac18460782 [rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70)
[CK] Suppress new staging compiler errors (#7384)

## Motivation

This should make new builds with staging compiler pass.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-14 12:51:08 -07:00
Yi DING
af7118e342 [rocm-libraries] ROCm/rocm-libraries#7331 (commit 5692db0)
[CK_TILE] Add async workspace prepare to FMHA BWD launcher (#7331)

## Motivation

`aiter::mha_bwd` in group mode currently issues two synchronous
`hipMemcpy` D2H copies to read `seqstart_q/k` for launcher construction.
These sync copies block the host (~10–30 µs each) and implicitly
synchronize the device by draining the stream, breaking CPU/GPU overlap
on hot training paths.

This PR adds a fully stream-async workspace preparation path on the FMHA
BWD launcher so callers can pre-allocate the device workspace from
upper-bound shapes and stage seqstart-dependent metadata via
D2H/host-pack/H2D entirely on the user's stream.

## Technical Details

- `FmhaBwdWorkspaceManager::GetWorkspaceDeviceSizeUpperBound`
(`include/ck_tile/ops/fmha/kernel/fmha_bwd_kernel.hpp`): computes the
worst-case device dq_acc size from `(max_batch, hdim_q, nhead_q,
max_seqlen_q, max_seqlen_k)` without dereferencing any seqstart array.
Mirrors `PrepareWorkspaceHost`'s return value with worst-case bounds.
- `fmha_bwd_launcher::prepare_workspace_async`
(`example/ck_tile/01_fmha/fmha_bwd.hpp`): on the caller's stream, in
order:
  1. `hipMemsetAsync` of the dq_acc region (when `NeedsZeroDqAcc()`)
2. group mode: `hipMemcpyAsync` D2H of `seqstart_q/k` into a pinned host
staging buffer
3. `hipLaunchHostFunc` runs `PrepareWorkspaceHost` on the pinned buffer
  4. `hipMemcpyAsync` H2D of the packed metadata into `device_ws_ptr`

The pinned staging buffer is held via `std::shared_ptr<void>` returned
by a caller-provided `pinned_host_alloc` callback. Lifetime is extended
past stream completion by a tail `hipLaunchHostFunc` scheduled in the
launcher's destructor.

- `ck_tile::pinned_host_releaser`
(`include/ck_tile/host/pinned_host_releaser.hpp`): worker-thread utility
for callers using bare `hipHostMalloc`. Defers `hipHostFree` off the HIP
driver callback thread, which holds runtime locks and would deadlock
against concurrent main-thread `hipFree`. PyTorch's
`CachingHostAllocator` does not need this.

- Example runner (`example/ck_tile/01_fmha/fmha_bwd_runner.hpp`):
switched to the async path.

## Test Plan

- `tile_example_fmha_bwd` (gfx950, dev preset `-Werror -Weverything`):
  - batch + nondet / batch + det / group + nondet / group + det
- group + det 4-batch varlen (`-b=4 -h=8 -s=4096,3072,2048,1024 -d=128`)
- FA (`flash-attention`) integration on ROCm 7.1.1 + PyTorch 2.9.1:
  - `tests/test_flash_attn_ck.py::test_flash_attn_varlen_deterministic`
  - `tests/test_flash_attn_ck.py::test_flash_attn_bwd_varlen_seqq_zero`

## Test Result

- All CK runner cases `valid:y`.
- FA pytest: **1952 passed in 44.82s**.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-14 21:33:21 +08:00
Qianfeng
0c63d6e776 [rocm-libraries] ROCm/rocm-libraries#7256 (commit 1fc20eb)
Skip numeric drop-out when PComputeWindow is a null_tile_window in Bl… (#7256)

The BlockDropout implementation already provides very complete logic for
generating random numbers and executing dropout for the P tensor after
first attention Gemm with capability to support both Warp-Gemm 32x32 and
16x16 as well as to run on both wave32 and wave64 arch.

But in some situation, we only need the block-layer process to generate
random numbers, rather than simultaneously execute dropout in real-time
on the vgpr tile. For example, xformers'
`test_mem_eff_attention.py::test_dropout_ck` requires the host reference
implementation of `attention forward with dropout` to use the same
random numbers to compare & verify the device side implementation of
`attention forward with dropout`, so a standalone kernel to generate
random numbers only is required.

This PR will enable xformers's random_val generating kernel (in file
`ck_tiled_rand_uniform_kernel.h`) to depend on BlockDropout's `Run()`
operator completely to generate random numbers for a `[MPerBlock,
NPerBlock]` tile during the tile iteration, no need to replicate the
logic of BlockDropout in the xformers kernel
2026-05-13 09:41:25 +00:00
Illia Silin
22b9feb40f [rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f)
[CK] Fix latest batch of staging compiler warnings (#7111)

## Motivation

Suppress the new batch of clang lifetimebound and invalidation warnings
with the latest staging compiler.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-08 07:14:14 -07:00
Yi DING
370c7d762b [rocm-libraries] ROCm/rocm-libraries#7141 (commit 37e40c3)
[CK_TILE] Fix typo in fmha_fwd_kernel K-dram unmerge tuple sizes (#7141)

## Summary

The qr_async_trload K-dram lambda's `else (XorLengthFold == 1)` branch
in `fmha_fwd_kernel.hpp` writes the outer-tile dim of its 3-tuple
unmerge/xor/merge as

```cpp
number<FmhaPipeline::kQKHeaddim / kDramTileK / FmhaPipeline::kAlignmentK>{}
```

which divides one extra time. For every fp16/bf16 hdim=128 configuration
the outer length collapses to **0**, e.g. `128 / 128 / 8 == 0`. The
3-tuple product no longer equals `kQKHeaddim`, so unmerge → xor → merge
stops round-tripping the head dimension.

This bug was masked by the async-load path: it only walks the descriptor
via stride and silently absorbs a length=0 outer dim. Any consumer that
actually traverses the descriptor (e.g. the TDM path on gfx1250)
immediately faults on the resulting `tuple<int, constant<0>>`.

The fix drops the extra `/ kAlignmentK` in all three call sites in the
same lambda so the outer dim becomes `kQKHeaddim / kDramTileK` and the
product is restored to `kQKHeaddim`. Strides are unaffected, so the
async path is bit-identical.

| Config (fp16/bf16) | hdim | kDramTileK | kAlignmentK | a (typo) | a
(fixed) | product (typo) | product (fixed) |
|---|---|---|---|---|---|---|---|
| hdim128, kKLoadOnce  | 128 | 128 | 8 | 0 | 1 | **0** | **128** |
| hdim128, kK0=32      | 128 |  32 | 8 | 0 | 4 | **0** | **128** |
| hdim64,  kKLoadOnce  |  64 |  64 | 8 | 0 | 1 | **0** | **64**  |
| hdim256, kK0=32      | 256 |  32 | 8 | 1 | 8 | **32** | **256** |

Bug introduced in 2cc0af6a815a (PR #2888 \"[CK_TILE] FMHA FWD bug
fix\"), where the original 2-tuple unmerge was generalized to a 3-tuple
and the typo slipped in.

## Test plan

- [x] Built `test_ck_tile_fmha_fwd` (umbrella, 5 gtest binaries) on
gfx950 native at develop b3bdc63a509 with `dev-gfx950` preset (clang 22,
ROCm 7.2.2). Compiles cleanly with `-Werror -Weverything`.
- [x] Ran `ctest -R test_ck_tile_fmha_fwd` on gfx950 native, baseline vs
patched: identical pass/fail (3 pass / 2 fail), identical failing case
set (114 gtest fails + 2 GPU memory access faults, all in pre-existing
fp16/bf16 group-mode `Alibi`/`Dropout` cases that reproduce on develop
without this patch). Total wall time 403s → 393s. Per-case latency drift
±8% (noise).
- [x] CI to verify on other gfx9 / gfx11 architectures.
2026-05-08 16:50:40 +08:00
Linjun-AMD
cb61576896 [rocm-libraries] ROCm/rocm-libraries#6873 (commit b61b3fb)
[CK] add swiglustep_and_mul activation to gridwise_moe_gemm (#6873)

Title:
feat(composablekernel): add swiglustep_and_mul activation to
gridwise_moe_gemm

  Description:
  ## Motivation

Step-3.5-Flash uses a clamped SwiGLU activation (`swiglu_limits[43]=7`,
  `swiglu_limits[44]=7`) for layers 43 and 44. Without this kernel path,
  those layers produce BOS token spam because unclamped gate/up values
  accumulate floating-point noise over 200+ decode steps, degrading
  output quality (cosine similarity drops from 0.999989 to ~0.998982).

  ## Changes

  Add `swiglustep_and_mul` as a new `Activation` enum branch in
  `gridwise_moe_gemm.hpp`, covering all 4 code paths:
  - Quantized (A×B scale) + IsInputGemm=true
  - Quantized (A×B scale) + IsInputGemm=false
  - Non-quantized + IsInputGemm=true
  - Non-quantized + IsInputGemm=false

  The activation computes:
  gate = silu(gate)
  gate = clamp(gate, max=7.0f)
  up   = clamp(up,   min=-7.0f, max=7.0f)
  output = gate * up

Also handles the `MulRoutedWeight` case (topk weight multiplication) and
  `pk_i4_t` weight scaling (×16 dequant factor).

  ## Verification

  - Tested on gfx950 (MI350X, 8×GPU)
- cosine similarity for layers 43/44: **0.999989** (vs 0.998982 before
fix)
  - End-to-end Step-3.5-Flash inference: no BOS spam, output coherent
  - BF16 tp=2/tp=4 and FP8 tp=2/tp=4 all verified PASS
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-07 05:59:47 +00:00
Linjun-AMD
5a2a362c46 [rocm-libraries] ROCm/rocm-libraries#6914 (commit b791478)
[CK_TILE][FMHA] Fix sink un-mask under right-window and emit fp8bf16 batch_prefill sink kernels (#6914)

## Summary

Two related fixes to `ck_tile` FMHA so that StreamLLM-sink +
sliding-window
  batch-prefill works correctly for fp8 KV / bf16 compute.

  Review the commits in this order:

  1. `fmha: emit sink kernels for fp8bf16 batch_prefill`
Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so
     the fp8(KV) / bf16(QO) batch-prefill codegen also emits the
`mask=mask_enum::generic_with_sink` variant. Without this the runtime
     could not dispatch to a sink-aware kernel for the fp8bf16 path.

  2. `fmha: respect right-window in IsOutOfSinkBound`
The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask
branch) used `(i_y + x) > 1` as the gate, which conditioned on the row
     index instead of the column index. As a result, queries `1..sink-1`
could attend to *future* sink positions (violating causal /
right-window),
while query `0` fell back to the plain causal mask. The fix replaces the
guard with `i_x < i_y + x` so every query only sees sink columns up to
     its own right-window boundary.

  3. `fmha: clarify IsOutOfSinkBound predicate comment`
Doc-only follow-up that rewrites the comment above the predicate as a
     clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`,
     `y < y_total`, `i_y < x_total`).

  ## Test plan

- [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 +
bf16_dequant
        modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`)
        now passes for all parametrized shapes.
- [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged —
codegen
diff only adds the `generic_with_sink` variant for fp8bf16; existing
        kernel object lists unaffected.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: fengjunda.aml <fengjunda.aml@bytedance.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: root <root@smci350-rck-g03-f12-31.rck.dcgpu>
2026-05-07 10:39:55 +08:00
Yi DING
2c677e8471 [rocm-libraries] ROCm/rocm-libraries#6152 (commit 36b016a)
[CK_TILE] Use Unified Workspace for FMHA BWD (#6152)

## Motivation
`dq_acc` is the intermediate accumulation buffer used in FMHA backward
pass for deterministic mode. The current implementation allocates it as
a **single rectangular tensor**:

```
shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q]
```

where `nsplits = launcher.dq_acc_splits` (a single scalar), computed
from `max_seqlen_k` and shared across all batches.

### Problems

1. **Memory waste**: In group mode, each batch may have a different
`seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing
batches with shorter `seqlen_k` to over-allocate in the split dimension.

2. **Interface coupling**: `fmha_bwd_args` exposes internal layout
details such as `stride_dq_acc`, `nhead_stride_dq_acc`,
`batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is
responsible for computing these strides, but this logic belongs inside
the kernel.

### Goals

1. Switch `dq_acc` buffer to a **compact layout**: batches are
concatenated contiguously, with each batch occupying `nhead * nsplits_i
* seqq_i * hdim_q` elements (nhead outermost).
2. **Remove all `*_stride_dq_acc` fields** from `fmha_bwd_args`,
replacing them with a single `workspace_ptr`; the kernel splits this
internally using a fixed layout.
4. `fmha_bwd_launcher` provides a **workspace management interface**:
the caller only needs to allocate GPU memory and call
`prepare_workspace()` — no layout computation required.
5. **Isolate kernel internals from the caller API**: the `dq_acc` layout
(nsplits, strides, buffer size) is determined entirely inside the
launcher/kernel. Future changes to block shape, pipeline type, or
persistent kernel strategy require no modifications to the caller's
`fmha_bwd_args` or workspace allocation logic.

## Technical Details

### Interface Design

#### New fields in `fmha_bwd_traits`

```cpp
struct fmha_bwd_traits
{
    int seqlen_q;
    int seqlen_k;
    int batch;
    int max_seqlen_q;
    int max_seqlen_k;
    int hdim_q;
    int hdim_v;
    int nhead_q;
    int nhead_k;
    std::string data_type;
    bool is_group_mode;
    mask_enum mask_type;
    bias_enum bias_type;
    bool has_dbias;
    bool has_dropout;
    bool is_store_randval;
    bool is_deterministic;
    // New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode).
    // seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1
    // seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1
    const int* seqstart_qs = nullptr;
    const int* seqstart_ks = nullptr;
};
```

#### `fmha_bwd_launcher` actual structure

```cpp
struct fmha_bwd_launcher
{
    std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{};

    // Total workspace size in bytes (host_ws_size + device_ws_size), computed by init().
    // Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed).
    size_t workspace_size = 0;

    fmha_bwd_launcher(const fmha_bwd_traits&);

    // Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace,
    // and zeros the dq_acc buffer portion (tail of workspace) if required.
    // The memory pointed to by device_ws must be >= workspace_size bytes.
    std::function<void(void* device_ws)> prepare_workspace{};

    template <typename... Args>
    float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); }

private:
    size_t host_ws_size   = 0;  // CPU workspace size (nsplits[] + offsets[] arrays)
    size_t device_ws_size = 0;  // GPU-only data size (dq_acc buffer)
    std::unique_ptr<char[]> ws_host;  // host-side workspace buffer

public:
    template <typename T0, typename T1, typename T2, typename Arch>
    void init(const fmha_bwd_traits& traits);
};
```

The `init<>()` template method (invoked by codegen dispatch branches as
`this->init<...>(t)`) is responsible for:
1. Setting the `run` lambda
2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain
`host_ws_size`
3. Allocating `ws_host` (host memory)
4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(),
...)` to fill nsplits/offsets; return value is `device_ws_size`
5. `workspace_size = host_ws_size + device_ws_size`
6. Setting the `prepare_workspace` lambda (captures `this`, calls
`PrepareWorkspaceDevice`)

When no kernel matches the given traits, both `run` and
`prepare_workspace` are initialized to default lambdas that print a
warning to `std::cerr` and return gracefully (no exception).

#### Workspace overall layout

The workspace is managed by `FmhaBwdWorkspaceManager` and consists of
two segments:

```
Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace):
  index_t nsplits[batch or 1]       — per-batch nsplits array
                                      group mode: batch elements
                                      batch mode / non-deterministic: 1 element
  [group mode only] long_index_t dq_acc_offsets[batch+1]
                                    — per-batch element offset (inclusive prefix sum)
                                      offsets[0]=0, offsets[i+1] = offsets[i] + nhead*nsplits_i*seqq_i*hdim_q

Offset host_ws_size (device data segment, device_ws_size bytes):
  AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required)
                                       total_elements = sum_i(nhead * nsplits_i * seqq_i * hdim_q)
                                       layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q]
                                       note: seqq_i uses the physical length (including padding)
```

Alignment constant (`ALIGNMENT = 16`):
```
nsplits_size  = align_up(sizeof(index_t) * N, 16)          // N = batch (group) or 1 (batch/non-det)
offsets_size  = align_up(sizeof(long_index_t) * (batch+1), 16)  // group mode only
host_ws_size  = nsplits_size + offsets_size
dq_acc_offset = host_ws_size  // GetDqAccDataOffset(batch)
```

**Key benefits**:
- The kernel reads nsplits/offsets directly from the workspace head — no
device-side recomputation.
- `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline
block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no
longer a template parameter, and multiple dq_dk_dv tiles with different
`F_bn0` values now share a single convert_dq kernel instance (under
receipt 1/2, deterministic convert_dq kernel count drops from ~300 to
60).
- nsplits/offsets are computed on the host and transferred in one
`hipMemcpy`; the dq_acc buffer follows immediately, at the offset given
by `GetDqAccDataOffset`.

#### Workspace size by scenario

| Scenario | `workspace_size` | Notes |
|----------|-----------------|-------|
| **kUseQrQtrDorPipeline** (any mode) | `0` | Writes dq directly; no acc
buffer; `PrepareWorkspaceHost` returns 0 |
| **Non-deterministic + batch mode** | `> 0` | nsplits[1]=1; dq_acc used
for atomic add; `workspace_size = host_ws_size +
batch*nhead*seqlen_q*hdim_q*ebytes` |
| **Non-deterministic + group mode** | `> 0` | nsplits[1]=1; dq_acc
contiguous layout; `workspace_size = host_ws_size +
nhead*seqstart_qs[batch]*hdim_q*ebytes` |
| **Deterministic + group mode** | `> 0` | nsplits[batch],
offsets[batch+1], compact dq_acc; nsplits_i computed independently per
batch |
| **Deterministic + batch mode persistent** | `> 0` | nsplits[1]
(uniform across batches); dq_acc `batch*nhead*nsplits*seqlen_q*hdim_q` |

**NeedsZeroDqAcc** (determines whether `PrepareWorkspaceDevice` calls
`hipMemset`):
- Persistent kernel (deterministic batch mode) or non-deterministic:
**must zero** (atomic add requires zero initialization)
- Deterministic group mode + no mask: **no zeroing needed** (every tile
writes its full region)
- Deterministic + with mask: **must zero** (some blocks are skipped,
leaving uninitialized tiles that would contribute to the reduction)

#### Caller usage

```cpp
// 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction)
fmha_bwd_launcher launcher(fmha_traits);

// 2. Read launcher.workspace_size directly
const auto ws_size = launcher.workspace_size;

// 3. Allocate a single GPU workspace
ck_tile::DeviceMem ws_buf(ws_size);

// 4. Copy nsplits/offsets to GPU head and zero dq_acc if required
launcher.prepare_workspace(ws_buf.GetDeviceBuffer());

// 5. Build args with a single workspace pointer; the kernel splits it internally
fmha_bwd_args args{
    ...,
    ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr,  // workspace_ptr
};
launcher(args, stream_config);
```

---

### Key Code Structure

#### FmhaBwdWorkspaceManager (`fmha_bwd_kernel.hpp`, new class)

```cpp
template <typename AccDataType, bool kIsGroupMode, bool kIsDeterministic>
struct FmhaBwdWorkspaceManager
{
    static constexpr size_t ALIGNMENT = 16;

    // CPU workspace (nsplits + offsets) sizes
    static size_t GetDqAccSplitsSize(int batch);   // align_up(sizeof(index_t)*N, 16)
    static size_t GetDqAccOffsetsSize(int batch);  // group mode only: align_up(sizeof(long_index_t)*(batch+1), 16)
    static size_t GetWorkspaceHostSize(int batch);  // = SplitsSize + OffsetsSize

    // Starting offset of dq_acc data within the full workspace (= host_ws_size)
    static size_t GetDqAccDataOffset(int batch);   // = GetWorkspaceHostSize(batch)

    // Fills nsplits/offsets in the CPU workspace; returns device_ws_size (dq_acc buffer bytes)
    template <bool kUseQrQtrDorPipeline, index_t kN0>
    static size_t PrepareWorkspaceHost(void* cpu_ws, index_t batch_size, index_t hdim_q,
                                       index_t nhead_q, index_t seqlen_q, index_t seqlen_k,
                                       const index_t* seqstart_qs, const index_t* seqstart_ks);

    // hipMemcpy's cpu_ws to device_ws head; hipMemset's the dq_acc portion to 0 if required
    template <bool kUseQrQtrDorPipeline, bool kHasMask>
    static void PrepareWorkspaceDevice(void* device_ws, const void* host_ws,
                                       size_t device_ws_size, size_t host_ws_size);
};
```

#### workspace_ptr parsing (inside the kernel)

The kernel parses three address regions from `kargs.workspace_ptr`:

**Group mode (`FmhaBwdDQDKDVKernel::MakeKargs`)**:
```cpp
const uint8_t* ws = reinterpret_cast<uint8_t*>(workspace_ptr);
// dq_acc_ptr (stored in FmhaBwdCommonKargs)
ws + WorkspaceManager::GetDqAccDataOffset(batch)
// dq_acc_batch_offset_ptr (FmhaBwdGroupModeKargs field)
reinterpret_cast<const long_index_t*>(ws + WorkspaceManager::GetDqAccOffsetsOffset(batch))
```

**Batch mode**:
```cpp
ws + WorkspaceManager::GetDqAccDataOffset(batch)  // dq_acc_ptr
// No offsets pointer; batch offset is computed inside run_() from nsplits
```

**`FmhaBwdConvertQGradKernel`** follows the same pattern:
- Group mode: extracts `dq_acc_ptr`, `dq_acc_batch_offset_ptr`, and
`nsplits_ptr` (`GetDqAccSplitsOffset(batch)`) from workspace
- Batch mode: reads nsplits from `nsplits_ptr[0]`; batch offset computed
internally

### Addressing in `run_()` (group mode)

```cpp
// Per-batch processing:
const long_index_t batch_offset_dq_acc = kargs.dq_acc_batch_offset_ptr[i_batch];
// seqq_i (physical length) derived from seqstart_q_ptr
const index_t seqq_i = kargs.seqstart_q_ptr[i_batch+1] - kargs.seqstart_q_ptr[i_batch];
// nsplits_i read from nsplits_ptr (convert_dq kernel) or from GetDqAccSplits
const long_index_t split_stride_i = static_cast<long_index_t>(seqq_i) * kargs.hdim_q;
const long_index_t nhead_stride_i = static_cast<long_index_t>(nsplits_i) * split_stride_i;
// Final address:
dq_acc_base + batch_offset_dq_acc + i_nhead * nhead_stride_i + i_split * split_stride_i
```

#### nsplits computation (`PrepareWorkspaceHost`)

`PrepareWorkspaceHost` is a template method of `FmhaBwdWorkspaceManager`
that still takes `kN0` as a template parameter (from
`BlockFmhaShape::kN0` of the dq_dk_dv pipeline). However, this parameter
is **only used inside this host-side function** to compute nsplits — it
is no longer passed into the convert_dq kernel.

| Mode | nsplits computation |
|------|---------------------|
| kUseQrQtrDorPipeline | Writes dq directly; nsplits[0]=0; returns
device_ws_size=0 |
| Non-deterministic | nsplits[0]=1; dq_acc used for atomic add |
| Deterministic + group mode | `ceil((seqstart_ks[i+1]-seqstart_ks[i]) /
kN0)` computed per batch |
| Deterministic + batch mode persistent | Same logic as the original
`GetDqAccSplits` (`dqdqkdv_workers` based) |

### Removing kN0 dependency from `FmhaBwdConvertQGradKernel`

`FmhaBwdConvertQGradKernel` previously required `kN0` as a template
parameter (via `BlockFmhaBwdConvertQGradPipelineProblem`) for two
purposes:
1. In batch mode `operator()`: self-computing `nsplits = ceil(seqlen_k /
kN0)`
2. The `b{kM0}x{kN0}` component of the kernel name string

Both have been removed in this refactor:
- **Batch mode**: now reads `kargs.nsplits_ptr[0]` directly (guarded by
`if constexpr(kIsDeterministic)` to avoid accessing a non-existent field
in non-deterministic instances)
- **Kernel name**: simplified to `b{kM0}`, no longer includes `kN0`
- **Template parameters**: `BlockFmhaBwdConvertQGradPipelineProblem`
drops the `kN0_` parameter; `fmha_bwd_convert_dq_traits_` drops the
`kN0` parameter; `F_bn0`/`convert_dq_bn0` fields removed from codegen

Effect: all dq_dk_dv tiles sharing the same `(hdim, dtype, mode, pad,
deterministic)` combination — regardless of `F_bn0` value
(16/64/128/192/256) — now share a **single** convert_dq kernel instance.

---

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-07 10:22:28 +08:00
Jeff Huang
faa9dc52cb [rocm-libraries] ROCm/rocm-libraries#6932 (commit ce3e67b)
[CK] Fix OOB page table read in batch_prefill V prefetch (AICK-1171) (#6932)

## Summary

Fix a GPU memory access fault in `mha_batch_prefill` triggered when the
per-batch page table is tightly sized (no trailing slack).

**Affected configurations:**
- All FMHA batch prefill V2 kernels
(`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`)
- Triggered by paged KV layouts where `kv_page_indices.numel() ==
ceil(seqlen_k / page_size)` exactly
- Manifests as: `Memory access fault by GPU node-X (Agent handle:
0x...)` followed by `Aborted (core dumped)`
- Silent corruption (no fault, wrong output) when the OOB read happens
to land in zero-initialized memory

### Root cause

`load_physical_pages` performs **lookahead reads** on the page table to
prefetch K/V tiles for the next iteration. When the page table for a
batch has exactly `N` entries, the V-tile prefetch indexes `page_idx[N]`
(one past the last valid entry), reading either uninitialized memory or
the next batch's slot. On gfx942 with a tightly-sized page table, the
read crosses into an unmapped page and triggers an HSA page fault.

The bug was masked in earlier testing because most test harnesses pad
`kv_page_indices` with trailing zeros — OOB reads then return `page_id =
0`, a valid in-cache page, producing silent numerical drift instead of a
fault.

### Fix design

Thread `max_page_table_idx = (seqlen_k - 1) / page_size` from the kernel
layer down to `load_physical_pages`, and clamp every page-table read
with `ck_tile::min()`. Applied to **all four code paths** in the V
prefetch:

| Branch | What it does | Clamp applied |
|--------|-------------|---------------|
| `kIsKcache` | K prefetch loop | `min(global_token_idx >>
kLog2PageSize, max_page_table_idx)` |
| V LINEAR (`page_size == 1`) | One token = one page |
`min(global_token_idx, max_page_table_idx)` |
| V crosses pages (`kVTileCrossesPages`) | Per-thread page lookup |
`min(global_token_idx >> kLog2PageSize, max_page_table_idx)` |
| V single page (lane0 broadcast) | `readfirstlane`-uniform lookup |
`min(... >> kLog2PageSize, max_page_table_idx)` |

### Key design decisions

**Mandatory parameter, not optional with a sentinel default.** An
optional `max_page_table_idx = INT32_MAX` default would let the bug
silently come back at any new callsite that forgets to pass it. Making
it mandatory forces every caller to opt in explicitly and surfaces
missed callsites at compile time.

**`seqlen_k == 0` clamps to 0** instead of underflowing `(0 - 1) /
page_size` to `-1`. The empty-batch case is rare but well-defined: clamp
every read to slot 0.

**Single computation in the kernel layer.**
`FmhaBatchPrefillWithPagedKVCacheKernel` computes `max_page_table_idx`
once per batch and forwards it through every QScale branch (PERTENSOR /
KV_BLOCKSCALE / default). All three `operator()` overloads of the
pipeline (rich, default forwarder, KV_BLOCKSCALE forwarder) take and
forward the parameter.

### Files changed

| File | Change |
|------|--------|
| `include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp` |
Compute `max_page_table_idx` per batch, forward to all 3 QScale branches
|
|
`include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`
| Add `max_page_table_idx` to `load_physical_pages` and 3 `operator()`
overloads; clamp page-id reads in 4 code paths |

## Test plan

- [x] AICK-1171 reproducer verified on MI-308X (gfx942)
- [x] New pytest case `test_batch_prefill_aick1171_oob_page_table_read`
in aiter, parametrized over `total_blocks ∈ {160, 164, 168, 176, 208,
256}` (matches the `crash1_r8_*` bisect family)
- [x] Full FMHA batch prefill suite on gfx942 + gfx950

## Linked issue

AICK-1171.
2026-05-05 14:28:19 +08:00
ltqin
501e7ef12a [rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)
[CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574)

## Summary

Add a CK_TILE forward kernel implementing [SageAttention
v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that
applies multi-granularity quantization to Q/K/V before computing
attention, trading minimal accuracy loss for higher throughput on
low-precision hardware.

### Quantization design

| Tensor | Supported data types | Scale granularity options |
|--------|---------------------|--------------------------|
| Q | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(32 tokens), per-thread (4 tokens) |
| K | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(64 tokens), per-thread (16 tokens) |
| V | fp8 | per-channel (always) |
| O | bf16 | — |

Three precision combinations are supported: `fp8/bf16` (QKV fp8, O
bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK
int4, V fp8, O bf16).

### Architecture support

- **gfx9** (CDNA2/3, e.g. gfx90a, gfx942) — full tile set
- **gfx950** (CDNA4) — restricted tile set (N-per-block capped at 64 for
fp8-family dtypes)

### Implementation

- Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC`
(async copy)
- Masking support: no mask, causal (top-left / bottom-right), and
generic windowed
- Batch and group (variable-length) modes
- Head dimension: d=128, d_v=128
- Python codegen under `example/ck_tile/49_sageattention/codegen/`
generates kernel instances per target/dtype/tile combination
- Smoke tests included via `tile_example_sageattn_fwd`

### Test commands

\`\`\`bash
# fp8 QKV
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=fp8bf16 -qscale=3 -init=3

# int8 QK, fp8 V
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=i8fp8bf16 -qscale=3 -init=3
\`\`\`

\`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread
2026-04-30 11:32:23 -07:00
Wojciech Laskowski
640bd560ec [rocm-libraries] ROCm/rocm-libraries#5801 (commit 27f6d15)
[CK Tile] Adding WMMA wrappers for dense builtins (#5801)

## Motivation

This PR is part of the [WMMA/MFMA] unification work. It's the first of
the series of PRs that add all the necessary MMA builtins as a
`amdgcn_mma` structs.

## Technical Details

This change adds new specializations for WMMA dense builtins. In total,
we have now 9 RDNA4 builtins and 3 RDNA3 builtins.

## Test Plan

All the new wrappers were added to the test suite in
`test_amdgcn_mma_layout.inc`.

## Test Result

Test pass locally, waiting for the CI.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Yung-sheng Tu <yung-sheng@streamhpc.com>
2026-04-27 11:57:51 +00:00
Qianfeng
0e6a514e4f [rocm-libraries] ROCm/rocm-libraries#6209 (commit 89c9f3e)
Improve the performance of qr_ks_vs_whole_k_prefetch pipeline (#6209)

## About qr_ks_vs_whole_k_prefetch pipeline
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to
improve performance on both MI350 GPUs through better MFMA instruction
usage, transposed V-loading support, and N0-loop implementation. The
pipeline targets scenarios where the number of workgroups is low,
enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs
128) while prefetching entire K tiles.

## Changes:

- Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload)
to avoid using shuffle instructions on MI350
- Implements N0-loop based Gemm0 to reduce tile window movement overhead
and eliminate `clear_tile` calls
- Adds full support for hdim96/hdim160 without padding requirements
- Updates MFMA instruction selection to ensure optimal choices for MI350

## Performance results

1. For attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance
than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms`
by whole_k_prefetch_trload & `58.50ms` by async_load), and
`qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better
performance than the recently tuned `qr_ks_vs_async` on the same case
(execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by
qr_ks_vs_async)
2. Also on MI300, for attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch` shows much better performance than the
`qr_ks_vs_async` (which is supposed to be very high-efficient) on the
same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by
qr_ks_vs_async)
3. For attention shapes which leads to kM0=128,
`qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better
performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by
whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows
completely on-par performance on MI300

## Test/Verify

1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to
test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can
not be used by ck_tile fmha example so far
2.  Use the following command-line for building/testing xformers
>```bash
> #> git clone -b test_whole_k_prefetch_n0loop
https://github.com/ROCm/xformers
> #> git submodule update --init --recursive
> #> pip  install --no-build-isolation -e ./
> #> pytest tests/test_mem_eff_attention.py::test_forward
>```
4. Any scripts which can run on xformers can be used to evaluate
qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to
switch from using different pipelines
> ```bash
> #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3
and qr_ks_vs_async_trload pipeline
> #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using
qr_ks_vs_async pipeline for comparing
> ```

## Discussion

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com>
Co-authored-by: qianfengz <12429178+qianfengz@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-04-24 10:30:41 -06:00