186 Commits

Author SHA1 Message Date
Yi DING
41064d8684 [rocm-libraries] ROCm/rocm-libraries#7141 (commit 37e40c3)
[CK_TILE] Fix typo in fmha_fwd_kernel K-dram unmerge tuple
 sizes (#7141)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

The qr_async_trload K-dram lambda's `else (XorLengthFold == 1)` branch
in `fmha_fwd_kernel.hpp` writes the outer-tile dim of its 3-tuple
unmerge/xor/merge as

```cpp
number<FmhaPipeline::kQKHeaddim / kDramTileK / FmhaPipeline::kAlignmentK>{}
```

which divides one extra time. For every fp16/bf16 hdim=128 configuration
the outer length collapses to **0**, e.g. `128 / 128 / 8 == 0`. The
3-tuple product no longer equals `kQKHeaddim`, so unmerge → xor → merge
stops round-tripping the head dimension.

This bug was masked by the async-load path: it only walks the descriptor
via stride and silently absorbs a length=0 outer dim. Any consumer that
actually traverses the descriptor (e.g. the TDM path on gfx1250)
immediately faults on the resulting `tuple<int, constant<0>>`.

The fix drops the extra `/ kAlignmentK` in all three call sites in the
same lambda so the outer dim becomes `kQKHeaddim / kDramTileK` and the
product is restored to `kQKHeaddim`. Strides are unaffected, so the
async path is bit-identical.

| Config (fp16/bf16) | hdim | kDramTileK | kAlignmentK | a (typo) | a
(fixed) | product (typo) | product (fixed) |
|---|---|---|---|---|---|---|---|
| hdim128, kKLoadOnce  | 128 | 128 | 8 | 0 | 1 | **0** | **128** |
| hdim128, kK0=32      | 128 |  32 | 8 | 0 | 4 | **0** | **128** |
| hdim64,  kKLoadOnce  |  64 |  64 | 8 | 0 | 1 | **0** | **64**  |
| hdim256, kK0=32      | 256 |  32 | 8 | 1 | 8 | **32** | **256** |

Bug introduced in 2cc0af6a815a (PR #2888 \"[CK_TILE] FMHA FWD bug
fix\"), where the original 2-tuple unmerge was generalized to a 3-tuple
and the typo slipped in.

## Test plan

- [x] Built `test_ck_tile_fmha_fwd` (umbrella, 5 gtest binaries) on
gfx950 native at develop b3bdc63a509 with `dev-gfx950` preset (clang 22,
ROCm 7.2.2). Compiles cleanly with `-Werror -Weverything`.
- [x] Ran `ctest -R test_ck_tile_fmha_fwd` on gfx950 native, baseline vs
patched: identical pass/fail (3 pass / 2 fail), identical failing case
set (114 gtest fails + 2 GPU memory access faults, all in pre-existing
fp16/bf16 group-mode `Alibi`/`Dropout` cases that reproduce on develop
without this patch). Total wall time 403s → 393s. Per-case latency drift
±8% (noise).
- [x] CI to verify on other gfx9 / gfx11 architectures.
2026-05-08 08:51:33 +00:00
Linjun-AMD
33b62ed087 [rocm-libraries] ROCm/rocm-libraries#6914 (commit b791478)
[CK_TILE][FMHA] Fix sink un-mask under right-window and emit
 fp8bf16 batch_prefill sink kernels (#6914)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Two related fixes to `ck_tile` FMHA so that StreamLLM-sink +
sliding-window
  batch-prefill works correctly for fp8 KV / bf16 compute.

  Review the commits in this order:

  1. `fmha: emit sink kernels for fp8bf16 batch_prefill`
Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so
     the fp8(KV) / bf16(QO) batch-prefill codegen also emits the
`mask=mask_enum::generic_with_sink` variant. Without this the runtime
     could not dispatch to a sink-aware kernel for the fp8bf16 path.

  2. `fmha: respect right-window in IsOutOfSinkBound`
The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask
branch) used `(i_y + x) > 1` as the gate, which conditioned on the row
     index instead of the column index. As a result, queries `1..sink-1`
could attend to *future* sink positions (violating causal /
right-window),
while query `0` fell back to the plain causal mask. The fix replaces the
guard with `i_x < i_y + x` so every query only sees sink columns up to
     its own right-window boundary.

  3. `fmha: clarify IsOutOfSinkBound predicate comment`
Doc-only follow-up that rewrites the comment above the predicate as a
     clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`,
     `y < y_total`, `i_y < x_total`).

  ## Test plan

- [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 +
bf16_dequant
        modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`)
        now passes for all parametrized shapes.
- [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged —
codegen
diff only adds the `generic_with_sink` variant for fp8bf16; existing
        kernel object lists unaffected.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-07 02:40:45 +00:00
Yi DING
207a95d5e4 [rocm-libraries] ROCm/rocm-libraries#6152 (commit 36b016a)
[CK_TILE] Use Unified Workspace for FMHA BWD
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
`dq_acc` is the intermediate accumulation buffer used in FMHA backward
pass for deterministic mode. The current implementation allocates it as
a **single rectangular tensor**:

```
shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q]
```

where `nsplits = launcher.dq_acc_splits` (a single scalar), computed
from `max_seqlen_k` and shared across all batches.

### Problems

1. **Memory waste**: In group mode, each batch may have a different
`seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing
batches with shorter `seqlen_k` to over-allocate in the split dimension.

2. **Interface coupling**: `fmha_bwd_args` exposes internal layout
details such as `stride_dq_acc`, `nhead_stride_dq_acc`,
`batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is
responsible for computing these strides, but this logic belongs inside
the kernel.

### Goals

1. Switch `dq_acc` buffer to a **compact layout**: batches are
concatenated contiguously, with each batch occupying `nhead * nsplits_i
* seqq_i * hdim_q` elements (nhead outermost).
2. **Remove all `*_stride_dq_acc` fields** from `fmha_bwd_args`,
replacing them with a single `workspace_ptr`; the kernel splits this
internally using a fixed layout.
4. `fmha_bwd_launcher` provides a **workspace management interface**:
the caller only needs to allocate GPU memory and call
`prepare_workspace()` — no layout computation required.
5. **Isolate kernel internals from the caller API**: the `dq_acc` layout
(nsplits, strides, buffer size) is determined entirely inside the
launcher/kernel. Future changes to block shape, pipeline type, or
persistent kernel strategy require no modifications to the caller's
`fmha_bwd_args` or workspace allocation logic.

## Technical Details

### Interface Design

#### New fields in `fmha_bwd_traits`

```cpp
struct fmha_bwd_traits
{
    int seqlen_q;
    int seqlen_k;
    int batch;
    int max_seqlen_q;
    int max_seqlen_k;
    int hdim_q;
    int hdim_v;
    int nhead_q;
    int nhead_k;
    std::string data_type;
    bool is_group_mode;
    mask_enum mask_type;
    bias_enum bias_type;
    bool has_dbias;
    bool has_dropout;
    bool is_store_randval;
    bool is_deterministic;
    // New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode).
    // seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1
    // seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1
    const int* seqstart_qs = nullptr;
    const int* seqstart_ks = nullptr;
};
```

#### `fmha_bwd_launcher` actual structure

```cpp
struct fmha_bwd_launcher
{
    std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{};

    // Total workspace size in bytes (host_ws_size + device_ws_size), computed by init().
    // Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed).
    size_t workspace_size = 0;

    fmha_bwd_launcher(const fmha_bwd_traits&);

    // Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace,
    // and zeros the dq_acc buffer portion (tail of workspace) if required.
    // The memory pointed to by device_ws must be >= workspace_size bytes.
    std::function<void(void* device_ws)> prepare_workspace{};

    template <typename... Args>
    float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); }

private:
    size_t host_ws_size   = 0;  // CPU workspace size (nsplits[] + offsets[] arrays)
    size_t device_ws_size = 0;  // GPU-only data size (dq_acc buffer)
    std::unique_ptr<char[]> ws_host;  // host-side workspace buffer

public:
    template <typename T0, typename T1, typename T2, typename Arch>
    void init(const fmha_bwd_traits& traits);
};
```

The `init<>()` template method (invoked by codegen dispatch branches as
`this->init<...>(t)`) is responsible for:
1. Setting the `run` lambda
2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain
`host_ws_size`
3. Allocating `ws_host` (host memory)
4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(),
...)` to fill nsplits/offsets; return value is `device_ws_size`
5. `workspace_size = host_ws_size + device_ws_size`
6. Setting the `prepare_workspace` lambda (captures `this`, calls
`PrepareWorkspaceDevice`)

When no kernel matches the given traits, both `run` and
`prepare_workspace` are initialized to default lambdas that print a
warning to `std::cerr` and return gracefully (no exception).

#### Workspace overall layout

The workspace is managed by `FmhaBwdWorkspaceManager` and consists of
two segments:

```
Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace):
  index_t nsplits[batch or 1]       — per-batch nsplits array
                                      group mode: batch elements
                                      batch mode / non-deterministic: 1 element
  [group mode only] long_index_t dq_acc_offsets[batch+1]
                                    — per-batch element offset (inclusive prefix sum)
                                      offsets[0]=0, offsets[i+1] = offsets[i] + nhead*nsplits_i*seqq_i*hdim_q

Offset host_ws_size (device data segment, device_ws_size bytes):
  AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required)
                                       total_elements = sum_i(nhead * nsplits_i * seqq_i * hdim_q)
                                       layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q]
                                       note: seqq_i uses the physical length (including padding)
```

Alignment constant (`ALIGNMENT = 16`):
```
nsplits_size  = align_up(sizeof(index_t) * N, 16)          // N = batch (group) or 1 (batch/non-det)
offsets_size  = align_up(sizeof(long_index_t) * (batch+1), 16)  // group mode only
host_ws_size  = nsplits_size + offsets_size
dq_acc_offset = host_ws_size  // GetDqAccDataOffset(batch)
```

**Key benefits**:
- The kernel reads nsplits/offsets directly from the workspace head — no
device-side recomputation.
- `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline
block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no
longer a template parameter, and multiple dq_dk_dv tiles with different
`F_bn0` values now share a single convert_dq kernel instance (under
receipt 1/2, deterministic convert_dq kernel count drops from ~300 to
60).
- nsplits/offsets are computed on the host and transferred in one
`hipMemcpy`; the dq_acc buffer follows immediately, at the offset given
by `GetDqAccDataOffset`.

#### Workspace size by scenario

| Scenario | `workspace_size` | Notes |
|----------|-----------------|-------|
| **kUseQrQtrDorPipeline** (any mode) | `0` | Writes dq directly; no acc
buffer; `PrepareWorkspaceHost` returns 0 |
| **Non-deterministic + batch mode** | `> 0` | nsplits[1]=1; dq_acc used
for atomic add; `workspace_size = host_ws_size +
batch*nhead*seqlen_q*hdim_q*ebytes` |
| **Non-deterministic + group mode** | `> 0` | nsplits[1]=1; dq_acc
contiguous layout; `workspace_size = host_ws_size +
nhead*seqstart_qs[batch]*hdim_q*ebytes` |
| **Deterministic + group mode** | `> 0` | nsplits[batch],
offsets[batch+1], compact dq_acc; nsplits_i computed independently per
batch |
| **Deterministic + batch mode persistent** | `> 0` | nsplits[1]
(uniform across batches); dq_acc `batch*nhead*nsplits*seqlen_q*hdim_q` |

**NeedsZeroDqAcc** (determines whether `PrepareWorkspaceDevice` calls
`hipMemset`):
- Persistent kernel (deterministic batch mode) or non-deterministic:
**must zero** (atomic add requires zero initialization)
- Deterministic group mode + no mask: **no zeroing needed** (every tile
writes its full region)
- Deterministic + with mask: **must zero** (some blocks are skipped,
leaving uninitialized tiles that would contribute to the reduction)

#### Caller usage

```cpp
// 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction)
fmha_bwd_launcher launcher(fmha_traits);

// 2. Read launcher.workspace_size directly
const auto ws_size = launcher.workspace_size;

// 3. Allocate a single GPU workspace
ck_tile::DeviceMem ws_buf(ws_size);

// 4. Copy nsplits/offsets to GPU head and zero dq_acc if required
launcher.prepare_workspace(ws_buf.GetDeviceBuffer());

// 5. Build args with a single workspace pointer; the kernel splits it internally
fmha_bwd_args args{
    ...,
    ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr,  // workspace_ptr
};
launcher(args, stream_config);
```
2026-05-07 02:23:28 +00:00
Jeff Huang
10cb6916c3 [rocm-libraries] ROCm/rocm-libraries#6932 (commit ce3e67b)
[CK] Fix OOB page table read in batch_prefill V prefetch
 (AICK-1171) (#6932)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Fix a GPU memory access fault in `mha_batch_prefill` triggered when the
per-batch page table is tightly sized (no trailing slack).

**Affected configurations:**
- All FMHA batch prefill V2 kernels
(`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`)
- Triggered by paged KV layouts where `kv_page_indices.numel() ==
ceil(seqlen_k / page_size)` exactly
- Manifests as: `Memory access fault by GPU node-X (Agent handle:
0x...)` followed by `Aborted (core dumped)`
- Silent corruption (no fault, wrong output) when the OOB read happens
to land in zero-initialized memory

### Root cause

`load_physical_pages` performs **lookahead reads** on the page table to
prefetch K/V tiles for the next iteration. When the page table for a
batch has exactly `N` entries, the V-tile prefetch indexes `page_idx[N]`
(one past the last valid entry), reading either uninitialized memory or
the next batch's slot. On gfx942 with a tightly-sized page table, the
read crosses into an unmapped page and triggers an HSA page fault.

The bug was masked in earlier testing because most test harnesses pad
`kv_page_indices` with trailing zeros — OOB reads then return `page_id =
0`, a valid in-cache page, producing silent numerical drift instead of a
fault.

### Fix design

Thread `max_page_table_idx = (seqlen_k - 1) / page_size` from the kernel
layer down to `load_physical_pages`, and clamp every page-table read
with `ck_tile::min()`. Applied to **all four code paths** in the V
prefetch:

| Branch | What it does | Clamp applied |
|--------|-------------|---------------|
| `kIsKcache` | K prefetch loop | `min(global_token_idx >>
kLog2PageSize, max_page_table_idx)` |
| V LINEAR (`page_size == 1`) | One token = one page |
`min(global_token_idx, max_page_table_idx)` |
| V crosses pages (`kVTileCrossesPages`) | Per-thread page lookup |
`min(global_token_idx >> kLog2PageSize, max_page_table_idx)` |
| V single page (lane0 broadcast) | `readfirstlane`-uniform lookup |
`min(... >> kLog2PageSize, max_page_table_idx)` |

### Key design decisions

**Mandatory parameter, not optional with a sentinel default.** An
optional `max_page_table_idx = INT32_MAX` default would let the bug
silently come back at any new callsite that forgets to pass it. Making
it mandatory forces every caller to opt in explicitly and surfaces
missed callsites at compile time.

**`seqlen_k == 0` clamps to 0** instead of underflowing `(0 - 1) /
page_size` to `-1`. The empty-batch case is rare but well-defined: clamp
every read to slot 0.

**Single computation in the kernel layer.**
`FmhaBatchPrefillWithPagedKVCacheKernel` computes `max_page_table_idx`
once per batch and forwards it through every QScale branch (PERTENSOR /
KV_BLOCKSCALE / default). All three `operator()` overloads of the
pipeline (rich, default forwarder, KV_BLOCKSCALE forwarder) take and
forward the parameter.

### Files changed

| File | Change |
|------|--------|
| `include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp` |
Compute `max_page_table_idx` per batch, forward to all 3 QScale branches
|
|
`include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`
| Add `max_page_table_idx` to `load_physical_pages` and 3 `operator()`
overloads; clamp page-id reads in 4 code paths |

## Test plan

- [x] AICK-1171 reproducer verified on MI-308X (gfx942)
- [x] New pytest case `test_batch_prefill_aick1171_oob_page_table_read`
in aiter, parametrized over `total_blocks ∈ {160, 164, 168, 176, 208,
256}` (matches the `crash1_r8_*` bisect family)
- [x] Full FMHA batch prefill suite on gfx942 + gfx950

## Linked issue

AICK-1171.
2026-05-05 06:29:51 +00:00
Qianfeng
865ab2b8ed [rocm-libraries] ROCm/rocm-libraries#6209 (commit 89c9f3e)
Improve the performance of qr_ks_vs_whole_k_prefetch pipeline
 (#6209)

## About qr_ks_vs_whole_k_prefetch pipeline
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to
improve performance on both MI350 GPUs through better MFMA instruction
usage, transposed V-loading support, and N0-loop implementation. The
pipeline targets scenarios where the number of workgroups is low,
enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs
128) while prefetching entire K tiles.

## Changes:

- Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload)
to avoid using shuffle instructions on MI350
- Implements N0-loop based Gemm0 to reduce tile window movement overhead
and eliminate `clear_tile` calls
- Adds full support for hdim96/hdim160 without padding requirements
- Updates MFMA instruction selection to ensure optimal choices for MI350

## Performance results

1. For attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance
than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms`
by whole_k_prefetch_trload & `58.50ms` by async_load), and
`qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better
performance than the recently tuned `qr_ks_vs_async` on the same case
(execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by
qr_ks_vs_async)
2. Also on MI300, for attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch` shows much better performance than the
`qr_ks_vs_async` (which is supposed to be very high-efficient) on the
same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by
qr_ks_vs_async)
3. For attention shapes which leads to kM0=128,
`qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better
performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by
whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows
completely on-par performance on MI300

## Test/Verify

1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to
test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can
not be used by ck_tile fmha example so far
2.  Use the following command-line for building/testing xformers
>```bash
> #> git clone -b test_whole_k_prefetch_n0loop
https://github.com/ROCm/xformers
> #> git submodule update --init --recursive
> #> pip  install --no-build-isolation -e ./
> #> pytest tests/test_mem_eff_attention.py::test_forward
>```
4. Any scripts which can run on xformers can be used to evaluate
qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to
switch from using different pipelines
> ```bash
> #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3
and qr_ks_vs_async_trload pipeline
> #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using
qr_ks_vs_async pipeline for comparing
> ```

## Discussion
2026-04-24 16:31:59 +00:00
Jeff Huang
fdf4bb7fcc [rocm-libraries] ROCm/rocm-libraries#6653 (commit 1df887e)
[CK_TILE] fix(fmha): support >2GB KV cache in batch prefill
 via template dispatch (#6653)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK batch prefill kernel previously failed (silent overflow + page
faults) when the KV cache exceeded 2 GB, blocking long-context inference
workloads (e.g., 128K+ token contexts with paged KV).

Two distinct failure modes were addressed:

1. **>4GB SRD overflow (`page_size < kN0`):** The SRD
`buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small
page sizes the rebased SRD spans the full KV pool and the offset wraps
past 2 GB, corrupting K/V loads.
2. **gfx950 page-table fault (`page_size >= kN0`):** On CDNA4 the
hardware validates the **full SRD `num_records` range** against
page-table permissions (CDNA3 only checks per-instruction `voffset`).
After per-tile SRD rebase, an un-trimmed `num_records` field extends
past the live page and faults on freed/protected memory.

## Technical Details

**Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad`
template parameter:**

| Case | `page_size` | KV cache size | Mode | Load path | Addressing |
|---|---|---|---|---|---|
| 1 | `>= kN0` (large pages) | any | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, bounded by per-page rebase |
| 2 | `< kN0` (small pages) | `<= 2 GB` | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, fits in INT32 byte range |
| 3 | `< kN0` (small pages) | `> 2 GB` | Global-load
(`kUseGlobalLoad=true`) | `async_load_tile_raw_flat` (K) +
`load_tile_flat` (V) | 64-bit |

**Dispatch:** the auto-gen API layer (`fmha_batch_prefill.py`) selects
the kernel instantiation at launch from `(page_block_size,
num_total_pages * batch_stride_k * kElementBytes)`, so the small-page
penalty is paid only when correctness requires it.

**gfx950 SRD `num_records` trimming:** in the K and V rebase lambdas of
`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`,
`set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after
each rebase to constrain `num_records` to the live page. Required for
CDNA4 page-table validation; harmless on CDNA3.

**Pipeline sync for the global-load path:**
- V uses synchronous `load_tile_flat`; K uses
`async_load_tile_raw_flat`.
- `v_physical_pages_current` is double-buffered so the V flat load
doesn't race against the next iteration's K rebase computation.

**Arch guards:** `global_load_lds` intrinsics are gated to `__gfx94__` /
`__gfx950__` (CDNA3+). Other architectures hit a `dependent_false`
static_assert with a descriptive message.

**Device-side assertion convention:** SRD setters use
`__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s
`assert()`. The latter introduces an `__assert_fail` call whose register
pressure scatters the K-SRD scalar register window across conditional
branches, corrupting `buffer_load_dwordx4` on gfx950.

## Test Plan

Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper
test suite. All coverage lives in **`op_tests/test_batch_prefill.py`**:

- **Functional matrix (96 cases)** — `test_batch_prefill`: `page_size ∈
{1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16,
fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1,
4}` (parametrized to exercise per-sequence SRD rebase across batch
boundaries).
- **>2 GB coverage** — `test_batch_prefill_large_kvcache`: extended to
allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true`
(small-page) and `kUseGlobalLoad=false` (large-page rebase) paths.
Includes both single-batch and multi-batch (`batch_size=4`) cases to
exercise per-sequence SRD rebase across the >2 GB pool.
- Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol`
from the existing batch prefill test harness.

## Test Result

| Arch | `test_batch_prefill` | `test_batch_prefill_large_kvcache` (>2
GB) |
|------|----------------------|---------------------|
| MI308 (gfx942) | All passed | Passed |
| MI355 (gfx950) | All passed | Passed |

**Performance impact (gfx950, hot SRD path):**
- +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang /
causal / soft_cap=30`, attributable in full to the two
`set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas
(5-run median, signal/noise ≈ 9×).
- This cost is **mandatory for gfx950 correctness** on >2 GB workloads —
removing the setters re-introduces page-faults.
- gfx942: 0 regressions in the same range (all configs ≤ +0.97%).

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-23 23:09:25 +00:00
Linjun-AMD
d22aafb48b [rocm-libraries] ROCm/rocm-libraries#6479 (commit 0705c2d)
CK][fmha] Add StreamLLM sink support to batch_prefill
 pipeline (#6479)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The existing paged-KV attention pipelines (pagedkv, splitkv) support
  StreamLLM-style sink tokens — a fixed set of initial tokens kept in
  attention alongside the sliding window. The `batch_prefill` pipeline
  (chunked-prefill with VLLM-style block tables) previously hardcoded
  `kHasSink = false`, making it incompatible with sink-based attention
  patterns in LLM serving scenarios.

  This PR extends `batch_prefill` to support `kHasSink` and wires it
into `fmha_fwd_runner` for validation against the existing CPU
reference.

## Technical Details

 **Pipeline** (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`):
- When `kHasSink`, the K/V loop splits into a sink phase [0,
sink_seq_end)
and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv.
  - K advance at the sink→window transition jumps
    `seqlen_k_start - sink_seq_end + kN0` to bridge the gap.
- V scatter-gather offsets are re-initialized at the transition to fix a
window mismatch bug: V was lagging kN0 behind K after the large jump,
    loading from the wrong sequence position.
- Bias window, dropout seq_offset, and mask type (LogitsSinkMask)
updated
    for sink-awareness.

**Traits / codegen** (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`,
`fmha_batch_prefill.py`):
- `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded
`false`).
- Codegen adds `F_sink` field; skips batch-mode kernels (group mode
required).
  - CMake test filter broadened from 9 → 33 instances covering
    fp16/bf16 × mask/nmask × lse/nlse × sink/nsink.

  **Runner** (`fmha_fwd_runner.hpp`, `CMakeLists.txt`):
  - `fmha_batch_prefill()` dispatched from `run_fwd` when:
    group mode + paged KV + num_splits == 1.
- K/V strides corrected for runner's [num_pages, nhead_k,
page_block_size, hdim] layout.
  - `page_block_size % 128` check relaxed: batch_prefill supports ps=16.
  - CPU reference paged-KV reordering guards extended with
    `CK_TILE_FMHA_FWD_BATCH_PREFILL_API`.

## Test Plan

Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run
  `tile_example_fmha_fwd` in group mode with page_block_size=16.

  Test matrix:
  - Mask: no-mask, causal, sliding window
  - Sink: nsink, sink=1..128
  - dtype: fp16, bf16
  - LSE output: on/off
  - seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024}
  - GQA, chunked prefill, large batch×seqlen
  - page_block_size: 16, 32

## Test Result

171 test cases, all valid:y:
  - nmask + nsink: ✓
  - causal + nsink: ✓
  - causal + sink=8: ✓
  - sliding window + sink=8 (d=128, d=256): ✓
  - bf16, LSE output, GQA: ✓

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-21 11:05:12 +00:00
Aviral Goel
e0dfe58d66 [rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)
CK: Remove 41 commented-out dead code blocks (~200 lines)
 (#6302)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Depends on #6300

## Summary

Remove 41 commented-out code blocks across 33 files in Composable
Kernel, totaling ~200 lines.

Identified using an automated dead code scanning skill (`ck-dead-code`)
with a calibrated two-stage pipeline:
1. **Pre-filter**: Keyword-based scan found 1,338 `//`-commented blocks.
Calibrated heuristics (trained on 50-sample expert classification)
reduced to 89 high-confidence candidates — 93% noise reduction.
2. **Expert triage**: LLM expert classified each block in context as
CODE_REMOVE, CODE_KEEP, or NOT_CODE.

| Classification | Count |
|---------------|-------|
| Removed (this PR) | 41 |
| Kept (debug helpers, alt configs, reference impls) | 32 |
| Not code (false positives) | 16 |

Removed blocks include: superseded implementations, old test data,
abandoned stubs, unreachable code, and buggy dead code.
2026-04-10 15:18:02 +00:00
Hosang Yoon
4c0e73ab12 [rocm-libraries] ROCm/rocm-libraries#6156 (commit 367565a)
[CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12
 (#6156)

## Motivation
On gfx11/gfx12, FMHA forward kernels that require head-dim padding show
a large performance drop compared to the exact-head-dim path. In
practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too
far off the fast path.

This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while
keeping the behavior for other GPUs unchanged.

## Technical Details

- Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path
for gfx11/gfx12.
- For `receipt=0`, keep support conservative and only enable the padded
fast path for vector-safe cases (`head_dim % 8 == 0`), matching the
existing assumption used on other GPUs.
- Move `v_prefetch` later only for the head-dim-padded path on
gfx11/gfx12. This reduces live ranges and removes the register-spill
behavior seen in the earlier scheduling.
- Enable the buffer-load OOB check offset trick for the padded path on
gfx11/gfx12.

## Test Plan

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result

Observed padded-head-dim performance improvements for HDIM=72/80:

- gfx11: about ~3.5x
- gfx1151: about ~2.0x
- gfx12: about ~1.3x

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-08 14:53:18 +00:00
Christopher Millette
a170e2bd9d [rocm-libraries] ROCm/rocm-libraries#5939 (commit 6fb1791)
[CK_TILE] Flatten nested static_for loops into static_ford
 (#5939)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary
Mechanical conversion of 129 nested `static_for`/`static_ford` patterns
to flat `static_ford` across 29 ck_tile header files.

Each conversion eliminates intermediate lambda closure instantiations by
replacing nested compile-time loops with a single flat iteration using
index decomposition.

### What `static_ford` eliminates

When `static_for` loops are nested, each level creates unique closure
types:
```cpp
// BEFORE: M + M×N = 20 IR functions (for M=4, N=4)
static_for<0, 4, 1>{}([&](auto m) {        // 4 closure instantiations
    static_for<0, 4, 1>{}([&](auto n) {     // 4×4 = 16 closure instantiations
        body(m, n);
    });
});

// AFTER: M×N = 16 IR functions (with ford_applier, no intermediates)
static_ford<sequence<4, 4>>{}([&](auto mn) {
    constexpr auto m = number<mn[number<0>{}]>{};
    constexpr auto n = number<mn[number<1>{}]>{};
    body(m, n);
});
```

### Pattern categories converted

| Category | Count | Description |
|----------|-------|-------------|
| C (2-level `static_for` chains) | 112 | Nested `static_for` →
`static_ford` |
| C3 (3-level `static_for` chains) | 9 | Three consecutive nests →
`static_ford` |
| Partial rescue | 3 | Outer 2 levels of blocked 4-level nests |
| B (nested `static_ford` merge) | 5 | Two nested `static_ford` → single
higher-dim `static_ford` |
| **Total** | **129** | Across 29 files |

6 false positives were detected and reverted (in `tensor_adaptor.hpp`,
`tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the
inner loop bound depended on the outer variable.

### Files changed by family

| Family | Files | Sites |
|--------|-------|-------|
| Block GEMM | 12 | ~20 |
| FlatMM pipelines | 4 | ~69 (including 5 ford-ford merges) |
| GEMM quant | 7 | ~22 |
| FlatMM kernel | 1 | 2 |
| FMHA | 1 | 2 |
| Reduce/norm | 2 | 2 |
| Epilogue | 1 | 1 |

### Blocked locations from review comments

- **block_gemm_areg_breg_creg_v1.hpp:356** — BLOCKED: runtime scale
loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between
every nesting level
- **block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228** — BLOCKED:
`zero_accumulators()` before inner loop; `sched_barrier` + conditional
`block_sync_lds()` after inner loop
- **block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298** — BLOCKED:
runtime `CWarpTensor` construction before inner loop; quantization scale
application code after inner loop
- **block_universal_gemm_as_aquant_bs_cr.hpp:277** — BLOCKED: same
pattern as above
- **block_universal_gemm_as_bs_bquant_cr.hpp:367** — BLOCKED: same
pattern as above

## Depends on
- #5938 ([CK_TILE] Optimize static_ford and sequence compile-time
infrastructure) — provides the `ford_applier` that makes these
conversions beneficial. Without it, `static_ford` uses a recursive
implementation that provides no IR function savings.

## Results (combined with #5938)

### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942)

| Target | Base (s) | Treat (s) | Delta | % | Significant? |
|--------|----------|-----------|-------|---|-------------|
| **flatmm** | 161.1 | 149.0 | **-12.1s** | **-7.5%** | **YES** (p<0.01,
7/7 wins) |
| **universal_gemm** | 225.4 | 220.3 | **-5.1s** | **-2.3%** | **YES**
(p<0.01, 7/7 wins) |

### IR Function Counts (device trace, gfx942)

| Target | InstFunc | CodeGen |
|--------|----------|---------|
| universal_gemm | **-8.5%** | **-9.2%** |
| flatmm | **-7.6%** | **-10.5%** |

### ASM Equivalence
5/5 PASS — 650,151 lines verified identical (gfx942). TUs:
universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale.

## Test plan
- [x] ASM equivalence verified (650K lines, gfx942)
- [x] Wilcoxon timing verified (7 trials, p<0.01)
- [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction)
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-07 14:38:07 +00:00
Po Yen Chen
c2ac7aa7b0 [rocm-libraries] ROCm/rocm-libraries#6051 (commit f0838b2)
[CK] Add FP8 per-tensor quantization support for FMHA V3
 pipeline (#6051)

## Motivation

The existing FMHA V3 pipeline only supports fp16/bf16 data types. This
PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950,
enabling higher throughput for
  FP8 inference workloads using the assembly-optimized V3 code path.

  ## Technical Details

  **Warp GEMM:**
- Add FP8 32x32x32 warp gemm with C-transposed distribution
(`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries

  **V3 Kernel (`fmha_fwd_v3_kernel.hpp`):**
- Add per-tensor descale support for Q, K, V tensors, passing descale
pointers through to pipeline kargs

  **V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`):**
  - Add FP8 data path with dtype-aware type selection
  - Add asm volatile P matrix conversion from f32 to fp8
  - Add FP8-aware instruction scheduling in `CoreLoopScheduler`

**V3 Pipeline Policy
(`block_fmha_fwd_v3_pipeline_default_policy.hpp`):**
- Add FP8 QK warp gemm selection (SwizzleB variant for V tile
distribution compatibility)

  **Codegen (`fmha_fwd.py`):**
  - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128)
- Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale:
no/pertensor)
  - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor

  **Misc:**
- Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum
(`arch.hpp`)
- Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`)

V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`)
pending further validation.

## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950)

  | Problem | Regular (TFlops) | Varlen (TFlops) |
  |---|---:|---:|
  | batch=1 heads=6/1 seqlen=1024 causal | 48.9 | 47.6 |
  | batch=1 heads=6/1 seqlen=2048 causal | 119.8 | 117.4 |
  | batch=1 heads=6/1 seqlen=4096 causal | 263.7 | 259.2 |
  | batch=1 heads=6/1 seqlen=8192 causal | 548.9 | 543.6 |
  | batch=1 heads=6/1 seqlen=16384 causal | 1043.0 | 1063.7 |
  | batch=1 heads=6/1 seqlen=32768 causal | 1237.2 | 1279.6 |
  | batch=1 heads=6/1 seqlen=65536 causal | 1315.4 | 1382.7 |
  | batch=1 heads=6/1 seqlen=131072 causal | 1326.3 | 1402.2 |
  | batch=1 heads=16/1 seqlen=65536 causal | 1298.7 | 1388.4 |
  | batch=1 heads=40/40 seqlen=37200 non-causal | 1248.9 | 1326.1 |

## Test Plan

Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering
batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40),
GQA ratios (1:1, 1:8), and
causal/non-causal modes. Verified all cases dispatch to the V3 pipeline
by enabling `F_is_v3_enabled` and confirming kernel names contain
`qr_async_trload_v3`.

  ## Test Result

176/176 tests passed with V3 enabled. All cases correctly dispatched to
V3 pipeline with `pertensor` quantization.

  ## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-07 14:20:43 +00:00
Jeff Huang
020b6f435e [rocm-libraries] ROCm/rocm-libraries#6201 (commit 5c0697e)
[CK_TILLE] Temporarily remove batch prefill KV cache overflow
 asserts (#6201)

## Summary
- Temporarily remove the KV cache offset overflow assert checks in
`FmhaBatchPrefillWithPagedKVCacheKernel`
- The asserts are **correct**, but they block project progress in
certain configurations
- This is a **temporary workaround** to unblock progress; a proper fix
will follow

## Note
This is NOT a permanent solution. A follow-up PR will add proper
overflow handling that addresses the underlying issue without blocking
progress.
2026-04-07 12:42:08 +00:00
Linjun-AMD
08792e0b31 [rocm-libraries] ROCm/rocm-libraries#5504 (commit 47f86c7)
[CK Tile] Add sink token gradient support in FMHA backward
 pass (#5504)

## Motivation

Adds sink token support to the FMHA backward kernel (dot_do_o pipeline):

## Technical Details

- Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType
- Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs
- Compute per-head sink gradient via atomic accumulation in the pipeline
- Update example runner with reference validation for sink gradient

## Test Plan

Add new test case

## Test Result

WIP

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-02 03:17:45 +00:00
Yi DING
791afc6465 [rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e)
[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR
 misallocation (#5991)

## Motivation

After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard
around the
`num_total_loop <= 0` early-exit check, the IGLP pipeline
(`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV
gradients for
non-masking kernels (even with fix in #5915). Assembly inspection
confirms that the CFG change causes the LLVM
register allocator to reuse AGPR accumulators as scratch destinations in
the dK/dV
reduction loop, breaking the loop-carried accumulation across Q-tile
iterations.

## Technical Details

- Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in
`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing:
it
restores the CFG shape that the register allocator needs to correctly
assign
  dedicated AGPRs to each column of the dK/dV accumulator.
- Only the IGLP pipeline is affected; the other two BWD pipelines do not
exhibit
  this issue.

## Test Plan

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-01 05:45:19 +00:00
Yi DING
fb64a4453c [rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d)
[CK_TILE] Fix FMHA BWD register pressure by wrapping
 num_total_loop with amd_wave_read_first_lane (#5915)

## Motivation

In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.

## Technical Details

Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`

This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 01:45:16 +00:00
Yi DING
47a04fda08 [rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a)
[CK_TILE] Fix NaN for FMHA BWD When seq_q=0

## Motivation
This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the
effective query sequence length for a tile is zero, by ensuring the
per-tile pipelines exit early with zeroed accumulators and by avoiding
an early kernel return that prevented writing out cleared gradients.

## Technical Details
- Add unconditional early-exit in the dK/dV pipelines when
`num_total_loop <= 0` (no work), returning zeroed accumulators.
- Adjust group-mode kernel early-return logic to only return when
**both** `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and
store cleared dK/dV when `seqlen_q == 0`.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 07:54:53 +00:00
yinglu
d460ab35b6 [rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a)
[CK_TILE] add tf32 support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [ ] Any dependent changes have been merged

## Discussion
2026-03-19 09:19:06 +00:00
Hosang
859acb5ae7 [rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6)
[CK_TILE] Add LLC-aware FMHA head grouping and head-major
 scheduling on RDNA (#5018)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.

This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.

In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.

## Technical Details
This PR adds two complementary strategies:

- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
  - Estimate LLC size (env override, KFD sysfs, or arch default).
  - Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).

- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
  - Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
     improving temporal locality of per-head K/V reuse.

Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
    deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.

## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`

## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x

- TFLOPs by sequence length target: gfx1201 layout: bshd

SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 21:19:23 +00:00
Yi DING
574c1c121a [rocm-libraries] ROCm/rocm-libraries#5174 (commit a358a21)
[CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic
 Mode (#5174)

## Motivation
This PR enables a persistent-kernel execution path for FMHA backward
(dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is
split, stored, and converted back to final gradients.

## Technical Details
- Introduces a persistent-kernel grid mapping in deterministic mode and
updates split-count calculation accordingly.
- Extends kernel kargs to carry batch-related info needed for persistent
scheduling and dQ conversion.
- Refactors dQ store conditions and adds mask-type traits/utilities and
runner logging updates.

## Test Plan
- Jenkins
[base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline)
- Jenkins
[AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline)
- Jenkins
[FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline)
- local FA tests

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-13 06:14:31 +00:00
Anton Gorenko
2312eef6c3 [rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc)
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on
 gfx950 (#4368)

## Motivation

Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline

## Technical Details

The microscaling is used when quant scale mode is
`BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are
fp8/bf8/fp4.

Supported features:
* only "qr" pipeline is implemented
* hdim 128 and 256 (smaller hdim are not possible due to restrictions of
"qr" pipeline, but they can be computed using instances with padding)
 * both 32x32x64 and 16x16x128 scale MFMAs are supported
 * Q and K scales are applied in hdim, V scales - in seqlen dimension
 * column-major V only
 * batch and group mode
 * bias, Alibi (tested but no instances by default, just like fp8)
 * masking etc.

Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 10:00:52 +00:00
rocking
fe8b7d0c27 [rocm-libraries] ROCm/rocm-libraries#4742 (commit d340a14)
[CK_TILE] Fix FMHA async pipeline LDS sync issue

## Motivation

Fix FMHA forward async pipeline
(`block_fmha_pipeline_qr_ks_vs_async.hpp`) sync issue.
Some attention test cases intermittently fail due to a race condition
where the V tile store to LDS overwrites K tile data that is still being
read by other threads during the tail `gemm_0` operation.

## Technical Details

In the `BlockFmhaPipelineQRKSVSAsync` pipeline, K and V tiles share the
same LDS memory through a rotation schedule (`LdsSeq`).
After the tail `gemm_0` (line 458), some fast threads may proceed to
store V to LDS (line 617) before slow threads finish reading K data from
the same LDS buffer.

The fix adds an `s_barrier` synchronization after the tail `gemm_0` when
K's last sub-tile and V's first sub-tile use the same LDS buffer (i.e.,
`LdsSeq[k0_loops - 1] == LdsSeq[k0_loops]`):

`if constexpr(LdsSeq.at(number<k0_loops - 1>{}) ==
LdsSeq.at(number<k0_loops>{}))
    __builtin_amdgcn_s_barrier();`

Why `s_barrier` alone is sufficient (no s_waitcnt lgkmcnt(0) needed):
The `gemm_0` MFMA instruction internally waits for its LDS operands
(ds_read) to complete before execution
Therefore, each thread's ds_read of K data is already complete by the
time gemm_0 finishes
Only cross-thread synchronization (`s_barrier`) is needed to ensure all
threads have finished reading before any thread starts writing V
2026-03-09 18:06:54 +00:00
Jeff Huang
6e558658ea [rocm-libraries] ROCm/rocm-libraries#4999 (commit 45f6624)
[CK] Fix 32-bit overflow in batch prefill kernel for >4GB KV
 cache (#4999)

Use SRD rebasing for page_block_size >= kN0: move SRD base pointer to
page start via 48-bit arithmetic, encode only within-page offset in
voffset. Original code path preserved for ps1/ps16 via constexpr-if.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-05 01:09:12 +00:00
Yi DING
b09112bbad [rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c)
[CK_TILE] FMHA BWD Launcher Interface

## Motivation
Reduce memory usage; Be prepared to implement optimizations of reducing
nsplits in deterministic cases.

## Technical Details
This PR introduces a new launcher interface for the FMHA backward
operation, replacing direct function calls with a more structured
approach. The launcher encapsulates kernel dispatch logic and provides
access to computed metadata like the number of dQ acc splits.

**Changes:**
- Added `fmha_bwd_launcher` class that wraps kernel execution and
exposes `dq_acc_splits`
- Moved `fmha_bwd_traits` construction earlier in the execution flow to
support launcher initialization
- Refactored code generation to produce both legacy API and new launcher
constructor

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 01:21:07 +00:00
SamiAario-AMD
95dc496d30 [rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)
Cleanup and refactoring related to tile loading
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Cleanup and refactoring done while implementing mixed precision for
fp16/bf16 x fp8

Key changes:

- Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and
refactored the API to use consistent naming conventions
- Updated load_tile_transpose functions to use output parameters instead
of return values for consistency
- Removed unused variable declarations and simplified type deduction
logic
- Define load_tile_with_elementwise to use tuple types explicitly for
clarity

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-03-02 12:21:44 +00:00
Linjun-AMD
78ae3835a6 [rocm-libraries] ROCm/rocm-libraries#4313 (commit 080ac66)
[CK] Fix gptoss sink

## Motivation

This PR removes conditional logic for handling infinity values in the
sink mechanism across multiple FMHA pipeline implementations, defaulting
sink_size to 0 and adding a constraint in the kernel selection logic.

## Technical Details

Changes:

Removed __builtin_isinf_sign(sink_v) checks and conditional
initialization of LSE accumulators across 7 pipeline files
Added default initialization (= 0) for sink_size in 4 argument structs
Added F_sink == "f" constraint to kernel compatibility checking

## Test Plan

Local test

## Test Result

passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-02 01:54:46 +00:00
Anton Gorenko
0d92fffedb [rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d)
[CK_TILE][FMHA] Support gfx11

## Motivation

Add support of gfx11 architectures (RDNA3) to FMHA.

## Technical Details

Distributions (matrix elements to lane registers mapping) of gfx11 WMMA
are completely different from distributions of gfx9 MFMA and gfx12 WMMA.
There are two cases in FMHA where this difference matters:
* usage of results (matrix C) of one GEMM as input (matrix A) of another
GEMM.
* random number generation for dropout (implementation for gfx9 MFMA,
gfx12 WMMA and host validation produce the same results).

Both cases are solved by a special remapping implemented using
`__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`.

Additional changes:
* FMHA tests are now build and run only for those types for which
instances exist (gfx11 supports only fp16 and bf16).
* Two fixes for uninitialized values (`mask.sink` and
`do_fp8_static_quant`): they may contain garbage resulting in incorrect
dispatching logic, sometimes tests report that there are no instance
available for current parameters.
* Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when
they are not requested (i.e. every time), likely has no effect on
performance but makes disassembly a bit clearer.

## Test Plan

```
ninja test_ck_tile_fmha

bin/test_ck_tile_fmha_fwd_fp16
bin/test_ck_tile_fmha_fwd_bf16
bin/test_ck_tile_fmha_bwd_fp16
bin/test_ck_tile_fmha_bwd_bf16
```

## Test Result

All tests must pass (some tests may be skipped).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-21 01:15:57 +00:00
Jeff Huang
7b18f5fed2 [rocm-libraries] ROCm/rocm-libraries#4263 (commit f34aec2)
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implement per-page K/V quantization for paged attention:
  - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
  - Use exp2 shift trick to eliminate explicit P scaling overhead
- Prefetch physical pages offset for KV cache, overlaps with
computations

## Proposed changes

Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-04 23:26:20 +00:00
Jeff Huang
e3556fed04 Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657)
- Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather
- Add is_gather_dim() and get_gather_index() for multi-dim page lookup
- Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match
  GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1})
- Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread
- Add static_assert for RowMajor VLayout requirement in batch prefill

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-29 07:18:41 +08:00
ltqin
67f0b74ec6 Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit de5a1d730d.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-23 09:03:22 -08:00
Po Yen Chen
de5a1d730d Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit dd0b4294af.
2026-01-22 21:21:19 -08:00
Linjun-AMD
0b13697a88 [CK_TILE][FMHA]Add new tile size for async (#3623)
* Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)"

This reverts commit 8f75869408.

* Add new tile_size for async pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-22 16:07:14 +08:00
ltqin
dd0b4294af Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel

* add block scale to kernel

* add smoke test

* format

* Revert "format"

This reverts commit 356c3c9706.

* only format my code

* format py

* fix auto not allowd in function prototype

* change instance tttt to ttff

* fix structured binding issue

* change s_acc elementwise op

* async pipeline add block scale

* add quantation P using shift exp2

* precompute (m - shift) once per row

* change blk scale seqstrt ptr name

* fix some name

* fix for  deduction guide

* fix some comments

* add P scale to qr_ksvs_pipeline

* add comment to idx_identity

* change the method of calculating descale block index

* unify naming style: use block_scale_ as name prefix

* unify naming style

* update the CHANGELOG.md

* Add FP8 block scale quantization support for FMHA forward kernel

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-21 20:58:26 -08:00
Yi DING
fcc9372c00 [CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615) 2026-01-21 09:54:46 +08:00
Linjun-AMD
8f75869408 Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)
This reverts commit f3aafb9555.
2026-01-20 09:40:54 -08:00
Linjun-AMD
f3aafb9555 [CK_TILE][FMHA] Add new tile size for async (#3586)
* add new tile size for async

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-19 15:22:33 -08:00
Jeff Huang
993d3e2f0e [FMHA] Enable page size 16 for batch prefill kernel (#3568)
* [FMHA] Enable page size 16 for batch prefill kernel

* Refactor batch prefill KV offset logic to simplify template arguments
- Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args.
- Add static assert to forbid `page_size=1` with vectorized layout.
2026-01-15 22:11:44 +08:00
Linjun-AMD
717ed0b59f [CK_TILE][FMHA] Enable gpt-oss sink (#3490)
* Enable gptoss sink

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add gptoss sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update CHANGELOG.md

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix test args error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update test_fmha_fwd.cpp

* update sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Revert "update sink test"

This reverts commit 970b4f1686.

* update sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update valid sink_v in splitkv pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update example_fmha_fwd.cpp

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix clangformat error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix aiter scale error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qr_ks_vs.hpp

* div scale_s for sink_value

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update fmha_fwd_runner.hpp

* update sink_value with bias

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Fix typo in dropout parameter in fmha_batch_prefill_kernel

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update example_fmha_fwd.cpp

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* optimized some code

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix splitkv error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update sink reference

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update fmha_fwd_runner.hpp

* Update smoke_test_fwd_sink.sh

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-14 21:32:06 +08:00
Thomas Ning
00c46785a8 Shuffle fix for gfx950 (#3491)
* solve compiler issue

* solve the gfx950 mfma shuffle regression

* refactor jenkinsfile to handle arch name better

* [CK TILE] set divisor to count of thread along k dimension

* fix the compiler error

* solve degradation

* Finish the multiplies fix

* fix the scales

* solve compilation error

* solve the composes

* solve the error of tile sweeper

* fix the test and example

* fix for gfx950

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
2026-01-13 09:21:29 -08:00
Jeff Huang
141f77aa12 [CK Tile] Fix FMHA LSE calculation and potential division by zero (#3326)
This commit addresses numerical stability issues in the BlockFmhaPipelineQRKSVS pipeline when bias has -inf masking values:
1. Explicitly handle the case where the accumulated exponential sum (l) is zero. In this case, the LSE is now correctly set to negative infinity, preventing log(0) errors.
2. Extend the zero-check protection in the normalization step to cover the ELEMENTWISE_BIAS case, preventing potential division by zero.
2026-01-13 13:52:26 +08:00
Jeff Huang
c9f112b026 [FMHA] Support page_size=1 (linear layout) in batch prefill pipeline (#3545)
- Enable page_size=1 support in batch prefill codegen (linear layout only).
- Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly.
- Relax `kPageBlockSize` alignment assertion for the page_size=1 case.
2026-01-13 12:04:43 +08:00
Jeff Huang
cc75a1dc5f [FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter

* add is_sglang_layout to  parameters

* add kv_offset_array_transform to batch async for page size 16

* add kv_last_page_lens to kernel

* change kv layout to [num_total_pages, page_block_size, hdim]

* format

* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
  batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)

* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024

* fix python format

* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]

* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
   traits
3. Add some static check for Page size, vector size, hdim, ..., etc.

* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration

Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.

**Changes:**
*   Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
*   Updated Kernel, Pipeline, and Traits to template on these Enums.
*   Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
*   Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
*   Updated CodeGen scripts to support new parameters.

This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.

* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.

* fix vgpr spill count

* remove unnecessary t2s functions

* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py

* support linear kv cache layout

* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.

* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.

* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.

* update changelog

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2026-01-05 18:41:47 +08:00
Po Yen Chen
1c3151963b [CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425)
* Add fp8bf16 support for batch_prefill

* Fix wrong scale_s re-compute logic in batch_prefill

* Fix wrong scale_s re-compute logic in fmha fwd

* Fix batch_prefill codegen error

* Remove no-longer used GetName() function

* Add fp8 logits=True instances

* Update CHANGELOG.md
2025-12-24 10:34:06 +08:00
Po Yen Chen
bfac64953f [CK_TILE][FMHA] Add logits soft-capping support for FAv3 (WIP) (#3355)
* Let fmha_fwd_v3() compatible with fmha_fwd()

* Decouple get_fwd_blobs() and FmhaFwdKernel

* Decouple compatibility checks from get_fwd_blobs()

* Extract product feature checks out from get_fwd_blobs()

* Remove duplicated code in factories and redundant checks

* Remove FmhaFwdKernel<>::GetName()

* Let FmhaFwdApiPool support pipelines with different mask_impl

* Add tile setting for fmha fwd v3 pipeline

* Add fwd v3 instances to tile_example_fmha_fwd manually

* Remove unused function import

* Undo irrelevant changes

* Remove fwd v3 instances from tile_example_fmha_fwd

* Finish fmha fwd v3 kernel instance codegen

* Fix formatting

* Remove unused F_idx attribute

* Add is_generic_attention_mask<> traits

* Add constraints to the fmha fwd v3 pipeline

* Unify traits & problem used for fmha fwd v3

* Unify kernel launch code for fmha fwd v2 & v3

* Unify kernel template selection logic

* Use same kernel codegen template for both v2 & v3

* Rename api() property as render() method

* Allow specifying filter for fmha fwd api pool

* Allow specifying function name when rendering api pool items

* Separate fmha fwd v3 kernel dispatching logic from v2

* Remove lambda assignment

* Add simple v2/v3 dispatch logic

* Stop generating empty if-clauses

Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them.

* Use "".join() to concatenate fmha fwd api string content

* Add more feature checks for fmha fwd v3 pipeline

* Check features before dispatch to fmha_fwd_v3()

* Add more feature checks for fmha_fwd_v3()

* Add missing filter call

* Use Tuple to reserve the dtype orders

* Fix wrong pipeline matching logic

* Add fmha fwd v3 group mode instances

* Add functor_transform<>

* Add type constraints to make_tile_window()

* Remove fmha fwd v3 example

* Fix wrong product(aiter mha_fwd()) config

* Fix wrong fmha fwd v2/v3 selection logic

* Fix formatting

* Add comment to warning v3 kernel users

* Fix wrong codegen logics

* Remove unnecessary param

* Fix format

* Add logits soft-capping support for fmha fwd v3 pipeline (WIP)

* Add missing Kargs base type

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-18 16:08:45 +08:00
ltqin
92653168c2 flashattention fwd add (80, 96) instance (#3415)
* add hdim (96,96) instance

* change to (80,96)

* format py

* remove 96 in optdim

* when N=6 change to llvm_amdgcn_raw_buffer_load_i32x3
2025-12-17 09:16:11 -08:00
Linjun-AMD
f5573f56d9 Add attention sink support for FMHA FWD (#3368)
* Revert "Revert "Add attn sink (#2892)" (#3250)"

This reverts commit 5adaa201ed.

* fix conflict

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Add F_sink parameter to FmhaFwdPipeline

* Update tile_fmha_traits.hpp

* Refactor pipeline creation in fmha_fwd.py

Updated the pipeline creation logic to include 'sink' parameter in product combinations and adjusted the FmhaFwdPipeline calls accordingly.

* Update fmha_fwd.py

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/script/correct_test_fwd_sink.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update CHANGELOG.md

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update CHANGELOG with new features and support

* Update fmha_fwd.hpp

* Update CHANGELOG.md

* Update smoke_test_fwd_sink.sh

* Update correct_test_fwd_sink.sh

* Update smoke_test_fwd_sink.sh

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-15 12:21:59 +08:00
Anton Gorenko
ca6143f0b2 Add a workaround for a compiler issue for bwd on gfx90a and ROCm 7.1.1 (#3369)
Sometimes there are not enough wait-states between v_mfma_f32... and v_accvgpr_read_b32 instructions if they are separated by s_cbranch.
The workaround is to read accvgprs to vgpr before branching.
2025-12-08 07:44:17 -08:00
Po Yen Chen
05292b3604 [CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153)
* Let fmha_fwd_v3() compatible with fmha_fwd()

* Decouple get_fwd_blobs() and FmhaFwdKernel

* Decouple compatibility checks from get_fwd_blobs()

* Extract product feature checks out from get_fwd_blobs()

* Remove duplicated code in factories and redundant checks

* Remove FmhaFwdKernel<>::GetName()

* Let FmhaFwdApiPool support pipelines with different mask_impl

* Add tile setting for fmha fwd v3 pipeline

* Add fwd v3 instances to tile_example_fmha_fwd manually

* Remove unused function import

* Undo irrelevant changes

* Remove fwd v3 instances from tile_example_fmha_fwd

* Finish fmha fwd v3 kernel instance codegen

* Fix formatting

* Remove unused F_idx attribute

* Add is_generic_attention_mask<> traits

* Add constraints to the fmha fwd v3 pipeline

* Unify traits & problem used for fmha fwd v3

* Unify kernel launch code for fmha fwd v2 & v3

* Unify kernel template selection logic

* Use same kernel codegen template for both v2 & v3

* Rename api() property as render() method

* Allow specifying filter for fmha fwd api pool

* Allow specifying function name when rendering api pool items

* Separate fmha fwd v3 kernel dispatching logic from v2

* Remove lambda assignment

* Add simple v2/v3 dispatch logic

* Stop generating empty if-clauses

Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them.

* Use "".join() to concatenate fmha fwd api string content

* Add more feature checks for fmha fwd v3 pipeline

* Check features before dispatch to fmha_fwd_v3()

* Add more feature checks for fmha_fwd_v3()

* Add missing filter call

* Use Tuple to reserve the dtype orders

* Fix wrong pipeline matching logic

* Add fmha fwd v3 group mode instances

* Add functor_transform<>

* Add type constraints to make_tile_window()

* Remove fmha fwd v3 example

* Fix wrong product(aiter mha_fwd()) config

* Fix wrong fmha fwd v2/v3 selection logic

* Fix formatting

* Add comment to warning v3 kernel users

* Fix wrong codegen logics

* Remove unnecessary param

* Fix format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-05 10:31:12 +08:00
rocking
eb7f617713 fp8 fmha async pipeline (#3339)
* replace qr with async pipeline

* Add fp8fp32 to DTYPE_BITS

* Add kAlignmentRandVal to avoid compile fail

* format

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-04 12:18:25 +08:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
rocking
229d43ea0c Fix batch prefill compile fail in aiter (#3279)
* Fix batch prefill aiter compile fail

* Fix compile error
2025-11-25 09:46:32 +08:00