mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-18 03:49:41 +00:00
ed0912fc0a8508a6ebc72c3b5ffefb72a1452cf8
132 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ed0912fc0a |
[CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on gfx950 (#6529)
[CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on
gfx950
## Motivation
Enable the existing V3 persistent kernel path for CK-Tile FMHA forward
on
gfx950 (MI350X/MI355X). The V3 kernel and codegen infrastructure already
exist but are disabled via hardcoded `F_is_v3_enabled=False`.
This change replaces the compile-time gate with a runtime environment
variable
`CK_FMHA_ENABLE_V3=1` (disabled by default, opt-in). When enabled:
- **Prefill** workloads (seqlen_q > 1) dispatch to V3 persistent
pipeline
- **Decode** workloads (seqlen_q == 1) always use V2 (memory-bound,
better suited)
The V3 persistent kernel uses grid-stride scheduling, XCD-interleave
tile
assignment for L2 locality, LPT reversal for causal masks, and gfx950
async
buffer loads.
## Technical Details
Single file: `example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py`
- Add `#include <cstdlib>` and `<string>` for `std::getenv`
- Replace `{F_is_v3_enabled}` template parameter with runtime env var
check
- Add `seqlen_q > 1` guard (decode always uses V2)
- Remove `.format()` call in `write_fwd_api()`
## Dependencies
Depends on https://github.com/ROCm/rocm-libraries/pull/6501 — builds on
XCD-interleave and LPT scheduling infrastructure.
## Test Plan
- GPU validation on MI300X (gfx942, ROCm 6.4.1):
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128
-prec=bf16 -v=1 -warmup=1 -repeat=3`
- GPU validation on MI350X (gfx950, ROCm 7.0):
- Command (V2): `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096
-d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3`
- Command (V3): `CK_FMHA_ENABLE_V3=1 ./build/bin/tile_example_fmha_fwd
-b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3`
- Command (decode, always V2): `./build/bin/tile_example_fmha_fwd -b=64
-h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1
-repeat=3`
## Test Result
Benchmark results (MI350X, gfx950, ROCm 7.0):
| Config | V2 (TFlops) | V3 (TFlops) | Speedup |
|--------|-------------|-------------|---------|
| Non-causal b=2 h=8 hk=2 s=4096 d=128 bf16 | 696.3 | 884.2 | **+27.0%**
|
| Causal b=2 h=8 hk=2 s=4096 d=128 bf16 | 371.3 | 494.9 | **+33.3%** |
| GQA b=2 h=32 hk=8 s=2048 d=128 bf16 | 671.3 | 831.7 | **+23.9%** |
| LLaMA-70B b=1 h=64 hk=8 s=4096 d=128 bf16 | 761.5 | 927.3 | **+21.8%**
|
| Causal GQA b=2 h=32 hk=8 s=2048 d=128 bf16 | 345.4 | 631.9 |
**+82.9%** |
| Long-seq b=1 h=16 s=16384 d=128 bf16 | 797.8 | 969.9 | **+21.6%** |
| Decode b=64 h=32 hk=8 s=1 s_k=4096 bf16 | 1828 GB/s | — (V2 path) |
unaffected |
Benchmark results (MI300X, gfx942, ROCm 6.4.1):
V3 has 0% effect on MI300X — V3 relies on gfx950 async buffer loads and
falls back to the V2 code path on gfx942. No regression on any config.
| Config | TFlops / GB/s | Time (ms) | Delta vs baseline |
|--------|-------------|-----------|-------------------|
| MHA bf16 b=2 h=8 s=4096 d=128 | 342.98 TFlops | 0.401 | +0.1% |
| MHA fp16 b=2 h=8 s=4096 d=128 | 411.18 TFlops | 0.334 | +4.9% |
| Causal MHA bf16 b=2 h=8 s=4096 d=128 | 232.61 TFlops | 0.296 | +2.4% |
| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 | 320.07 TFlops | 0.429 |
-1.4% |
| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 | 353.91 TFlops | 0.777 |
+1.7% |
| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 | 381.53 TFlops |
1.441 | +1.2% |
| Long-seq bf16 b=1 h=16 s=16384 d=128 | 388.61 TFlops | 5.659 | +1.4% |
| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 | 693.40 GB/s | 1.550 |
+0.3% |
All validation tests pass (`valid:y`) on both MI300X and MI350X.
Additional validation:
- `CK_FMHA_ENABLE_V3=0` correctly falls back to V2 (default behavior
unchanged)
- `CK_FMHA_ENABLE_V3=1` dispatches to V3 for prefill, V2 for decode
- Validation passes across fp16/bf16, batch/group mode,
causal/non-causal
- No regression on decode path
---------
Co-authored-by: Chao Zhou <chaozhou@fb.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
|
||
|
|
1cf336d87a |
[CK_TILE][FMHA] Fix sink un-mask under right-window and emit fp8bf16 batch_prefill sink kernels (#6914)
## Summary
Two related fixes to `ck_tile` FMHA so that StreamLLM-sink +
sliding-window
batch-prefill works correctly for fp8 KV / bf16 compute.
Review the commits in this order:
1. `fmha: emit sink kernels for fp8bf16 batch_prefill`
Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so
the fp8(KV) / bf16(QO) batch-prefill codegen also emits the
`mask=mask_enum::generic_with_sink` variant. Without this the runtime
could not dispatch to a sink-aware kernel for the fp8bf16 path.
2. `fmha: respect right-window in IsOutOfSinkBound`
The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask
branch) used `(i_y + x) > 1` as the gate, which conditioned on the row
index instead of the column index. As a result, queries `1..sink-1`
could attend to *future* sink positions (violating causal /
right-window),
while query `0` fell back to the plain causal mask. The fix replaces the
guard with `i_x < i_y + x` so every query only sees sink columns up to
its own right-window boundary.
3. `fmha: clarify IsOutOfSinkBound predicate comment`
Doc-only follow-up that rewrites the comment above the predicate as a
clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`,
`y < y_total`, `i_y < x_total`).
## Test plan
- [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 +
bf16_dequant
modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`)
now passes for all parametrized shapes.
- [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged —
codegen
diff only adds the `generic_with_sink` variant for fp8bf16; existing
kernel object lists unaffected.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: fengjunda.aml <fengjunda.aml@bytedance.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: root <root@smci350-rck-g03-f12-31.rck.dcgpu>
|
||
|
|
0398b864c3 |
[CK_TILE] Use Unified Workspace for FMHA BWD (#6152)
## Motivation
`dq_acc` is the intermediate accumulation buffer used in FMHA backward
pass for deterministic mode. The current implementation allocates it as
a **single rectangular tensor**:
```
shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q]
```
where `nsplits = launcher.dq_acc_splits` (a single scalar), computed
from `max_seqlen_k` and shared across all batches.
### Problems
1. **Memory waste**: In group mode, each batch may have a different
`seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing
batches with shorter `seqlen_k` to over-allocate in the split dimension.
2. **Interface coupling**: `fmha_bwd_args` exposes internal layout
details such as `stride_dq_acc`, `nhead_stride_dq_acc`,
`batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is
responsible for computing these strides, but this logic belongs inside
the kernel.
### Goals
1. Switch `dq_acc` buffer to a **compact layout**: batches are
concatenated contiguously, with each batch occupying `nhead * nsplits_i
* seqq_i * hdim_q` elements (nhead outermost).
2. **Remove all `*_stride_dq_acc` fields** from `fmha_bwd_args`,
replacing them with a single `workspace_ptr`; the kernel splits this
internally using a fixed layout.
4. `fmha_bwd_launcher` provides a **workspace management interface**:
the caller only needs to allocate GPU memory and call
`prepare_workspace()` — no layout computation required.
5. **Isolate kernel internals from the caller API**: the `dq_acc` layout
(nsplits, strides, buffer size) is determined entirely inside the
launcher/kernel. Future changes to block shape, pipeline type, or
persistent kernel strategy require no modifications to the caller's
`fmha_bwd_args` or workspace allocation logic.
## Technical Details
### Interface Design
#### New fields in `fmha_bwd_traits`
```cpp
struct fmha_bwd_traits
{
int seqlen_q;
int seqlen_k;
int batch;
int max_seqlen_q;
int max_seqlen_k;
int hdim_q;
int hdim_v;
int nhead_q;
int nhead_k;
std::string data_type;
bool is_group_mode;
mask_enum mask_type;
bias_enum bias_type;
bool has_dbias;
bool has_dropout;
bool is_store_randval;
bool is_deterministic;
// New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode).
// seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1
// seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1
const int* seqstart_qs = nullptr;
const int* seqstart_ks = nullptr;
};
```
#### `fmha_bwd_launcher` actual structure
```cpp
struct fmha_bwd_launcher
{
std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{};
// Total workspace size in bytes (host_ws_size + device_ws_size), computed by init().
// Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed).
size_t workspace_size = 0;
fmha_bwd_launcher(const fmha_bwd_traits&);
// Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace,
// and zeros the dq_acc buffer portion (tail of workspace) if required.
// The memory pointed to by device_ws must be >= workspace_size bytes.
std::function<void(void* device_ws)> prepare_workspace{};
template <typename... Args>
float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); }
private:
size_t host_ws_size = 0; // CPU workspace size (nsplits[] + offsets[] arrays)
size_t device_ws_size = 0; // GPU-only data size (dq_acc buffer)
std::unique_ptr<char[]> ws_host; // host-side workspace buffer
public:
template <typename T0, typename T1, typename T2, typename Arch>
void init(const fmha_bwd_traits& traits);
};
```
The `init<>()` template method (invoked by codegen dispatch branches as
`this->init<...>(t)`) is responsible for:
1. Setting the `run` lambda
2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain
`host_ws_size`
3. Allocating `ws_host` (host memory)
4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(),
...)` to fill nsplits/offsets; return value is `device_ws_size`
5. `workspace_size = host_ws_size + device_ws_size`
6. Setting the `prepare_workspace` lambda (captures `this`, calls
`PrepareWorkspaceDevice`)
When no kernel matches the given traits, both `run` and
`prepare_workspace` are initialized to default lambdas that print a
warning to `std::cerr` and return gracefully (no exception).
#### Workspace overall layout
The workspace is managed by `FmhaBwdWorkspaceManager` and consists of
two segments:
```
Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace):
index_t nsplits[batch or 1] — per-batch nsplits array
group mode: batch elements
batch mode / non-deterministic: 1 element
[group mode only] long_index_t dq_acc_offsets[batch+1]
— per-batch element offset (inclusive prefix sum)
offsets[0]=0, offsets[i+1] = offsets[i] + nhead*nsplits_i*seqq_i*hdim_q
Offset host_ws_size (device data segment, device_ws_size bytes):
AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required)
total_elements = sum_i(nhead * nsplits_i * seqq_i * hdim_q)
layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q]
note: seqq_i uses the physical length (including padding)
```
Alignment constant (`ALIGNMENT = 16`):
```
nsplits_size = align_up(sizeof(index_t) * N, 16) // N = batch (group) or 1 (batch/non-det)
offsets_size = align_up(sizeof(long_index_t) * (batch+1), 16) // group mode only
host_ws_size = nsplits_size + offsets_size
dq_acc_offset = host_ws_size // GetDqAccDataOffset(batch)
```
**Key benefits**:
- The kernel reads nsplits/offsets directly from the workspace head — no
device-side recomputation.
- `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline
block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no
longer a template parameter, and multiple dq_dk_dv tiles with different
`F_bn0` values now share a single convert_dq kernel instance (under
receipt 1/2, deterministic convert_dq kernel count drops from ~300 to
60).
- nsplits/offsets are computed on the host and transferred in one
`hipMemcpy`; the dq_acc buffer follows immediately, at the offset given
by `GetDqAccDataOffset`.
#### Workspace size by scenario
| Scenario | `workspace_size` | Notes |
|----------|-----------------|-------|
| **kUseQrQtrDorPipeline** (any mode) | `0` | Writes dq directly; no acc
buffer; `PrepareWorkspaceHost` returns 0 |
| **Non-deterministic + batch mode** | `> 0` | nsplits[1]=1; dq_acc used
for atomic add; `workspace_size = host_ws_size +
batch*nhead*seqlen_q*hdim_q*ebytes` |
| **Non-deterministic + group mode** | `> 0` | nsplits[1]=1; dq_acc
contiguous layout; `workspace_size = host_ws_size +
nhead*seqstart_qs[batch]*hdim_q*ebytes` |
| **Deterministic + group mode** | `> 0` | nsplits[batch],
offsets[batch+1], compact dq_acc; nsplits_i computed independently per
batch |
| **Deterministic + batch mode persistent** | `> 0` | nsplits[1]
(uniform across batches); dq_acc `batch*nhead*nsplits*seqlen_q*hdim_q` |
**NeedsZeroDqAcc** (determines whether `PrepareWorkspaceDevice` calls
`hipMemset`):
- Persistent kernel (deterministic batch mode) or non-deterministic:
**must zero** (atomic add requires zero initialization)
- Deterministic group mode + no mask: **no zeroing needed** (every tile
writes its full region)
- Deterministic + with mask: **must zero** (some blocks are skipped,
leaving uninitialized tiles that would contribute to the reduction)
#### Caller usage
```cpp
// 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction)
fmha_bwd_launcher launcher(fmha_traits);
// 2. Read launcher.workspace_size directly
const auto ws_size = launcher.workspace_size;
// 3. Allocate a single GPU workspace
ck_tile::DeviceMem ws_buf(ws_size);
// 4. Copy nsplits/offsets to GPU head and zero dq_acc if required
launcher.prepare_workspace(ws_buf.GetDeviceBuffer());
// 5. Build args with a single workspace pointer; the kernel splits it internally
fmha_bwd_args args{
...,
ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr, // workspace_ptr
};
launcher(args, stream_config);
```
---
### Key Code Structure
#### FmhaBwdWorkspaceManager (`fmha_bwd_kernel.hpp`, new class)
```cpp
template <typename AccDataType, bool kIsGroupMode, bool kIsDeterministic>
struct FmhaBwdWorkspaceManager
{
static constexpr size_t ALIGNMENT = 16;
// CPU workspace (nsplits + offsets) sizes
static size_t GetDqAccSplitsSize(int batch); // align_up(sizeof(index_t)*N, 16)
static size_t GetDqAccOffsetsSize(int batch); // group mode only: align_up(sizeof(long_index_t)*(batch+1), 16)
static size_t GetWorkspaceHostSize(int batch); // = SplitsSize + OffsetsSize
// Starting offset of dq_acc data within the full workspace (= host_ws_size)
static size_t GetDqAccDataOffset(int batch); // = GetWorkspaceHostSize(batch)
// Fills nsplits/offsets in the CPU workspace; returns device_ws_size (dq_acc buffer bytes)
template <bool kUseQrQtrDorPipeline, index_t kN0>
static size_t PrepareWorkspaceHost(void* cpu_ws, index_t batch_size, index_t hdim_q,
index_t nhead_q, index_t seqlen_q, index_t seqlen_k,
const index_t* seqstart_qs, const index_t* seqstart_ks);
// hipMemcpy's cpu_ws to device_ws head; hipMemset's the dq_acc portion to 0 if required
template <bool kUseQrQtrDorPipeline, bool kHasMask>
static void PrepareWorkspaceDevice(void* device_ws, const void* host_ws,
size_t device_ws_size, size_t host_ws_size);
};
```
#### workspace_ptr parsing (inside the kernel)
The kernel parses three address regions from `kargs.workspace_ptr`:
**Group mode (`FmhaBwdDQDKDVKernel::MakeKargs`)**:
```cpp
const uint8_t* ws = reinterpret_cast<uint8_t*>(workspace_ptr);
// dq_acc_ptr (stored in FmhaBwdCommonKargs)
ws + WorkspaceManager::GetDqAccDataOffset(batch)
// dq_acc_batch_offset_ptr (FmhaBwdGroupModeKargs field)
reinterpret_cast<const long_index_t*>(ws + WorkspaceManager::GetDqAccOffsetsOffset(batch))
```
**Batch mode**:
```cpp
ws + WorkspaceManager::GetDqAccDataOffset(batch) // dq_acc_ptr
// No offsets pointer; batch offset is computed inside run_() from nsplits
```
**`FmhaBwdConvertQGradKernel`** follows the same pattern:
- Group mode: extracts `dq_acc_ptr`, `dq_acc_batch_offset_ptr`, and
`nsplits_ptr` (`GetDqAccSplitsOffset(batch)`) from workspace
- Batch mode: reads nsplits from `nsplits_ptr[0]`; batch offset computed
internally
### Addressing in `run_()` (group mode)
```cpp
// Per-batch processing:
const long_index_t batch_offset_dq_acc = kargs.dq_acc_batch_offset_ptr[i_batch];
// seqq_i (physical length) derived from seqstart_q_ptr
const index_t seqq_i = kargs.seqstart_q_ptr[i_batch+1] - kargs.seqstart_q_ptr[i_batch];
// nsplits_i read from nsplits_ptr (convert_dq kernel) or from GetDqAccSplits
const long_index_t split_stride_i = static_cast<long_index_t>(seqq_i) * kargs.hdim_q;
const long_index_t nhead_stride_i = static_cast<long_index_t>(nsplits_i) * split_stride_i;
// Final address:
dq_acc_base + batch_offset_dq_acc + i_nhead * nhead_stride_i + i_split * split_stride_i
```
#### nsplits computation (`PrepareWorkspaceHost`)
`PrepareWorkspaceHost` is a template method of `FmhaBwdWorkspaceManager`
that still takes `kN0` as a template parameter (from
`BlockFmhaShape::kN0` of the dq_dk_dv pipeline). However, this parameter
is **only used inside this host-side function** to compute nsplits — it
is no longer passed into the convert_dq kernel.
| Mode | nsplits computation |
|------|---------------------|
| kUseQrQtrDorPipeline | Writes dq directly; nsplits[0]=0; returns
device_ws_size=0 |
| Non-deterministic | nsplits[0]=1; dq_acc used for atomic add |
| Deterministic + group mode | `ceil((seqstart_ks[i+1]-seqstart_ks[i]) /
kN0)` computed per batch |
| Deterministic + batch mode persistent | Same logic as the original
`GetDqAccSplits` (`dqdqkdv_workers` based) |
### Removing kN0 dependency from `FmhaBwdConvertQGradKernel`
`FmhaBwdConvertQGradKernel` previously required `kN0` as a template
parameter (via `BlockFmhaBwdConvertQGradPipelineProblem`) for two
purposes:
1. In batch mode `operator()`: self-computing `nsplits = ceil(seqlen_k /
kN0)`
2. The `b{kM0}x{kN0}` component of the kernel name string
Both have been removed in this refactor:
- **Batch mode**: now reads `kargs.nsplits_ptr[0]` directly (guarded by
`if constexpr(kIsDeterministic)` to avoid accessing a non-existent field
in non-deterministic instances)
- **Kernel name**: simplified to `b{kM0}`, no longer includes `kN0`
- **Template parameters**: `BlockFmhaBwdConvertQGradPipelineProblem`
drops the `kN0_` parameter; `fmha_bwd_convert_dq_traits_` drops the
`kN0` parameter; `F_bn0`/`convert_dq_bn0` fields removed from codegen
Effect: all dq_dk_dv tiles sharing the same `(hdim, dtype, mode, pad,
deterministic)` combination — regardless of `F_bn0` value
(16/64/128/192/256) — now share a **single** convert_dq kernel instance.
---
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
2a479f7411 |
[CK][CK_TILE] Fix FMHA codegen group mode dispatch (#6764)
## Motivation FMHA codegen had incorrect dispatch behavior in group mode. Two root causes: 1. Wrong field names in dispatch conditions — Used batch-mode fields (seqlen_q, seqlen_k) instead of group-mode fields (max_seqlen_q, max_seqlen_k), causing wrong kernel selection at runtime on gfx950. 2. Missing kernel variants — Group mode was overly filtered out from smaller-tile specializations (bwd) and lacked spatial-padding pipeline variants on gfx950 (fwd). gfx942 don't support trload pipeline. ## Technical Details fmha_bwd.py: - max_seq_q_cond and extra_cond now emit t.max_seqlen_q / t.max_seqlen_k for group mode. - Relaxed kernel filtering: group mode no longer skips tiles with max_seq_q != 0. fmha_fwd.py: - get_bm0_cond emits a.max_seqlen_q for group mode tile-size dispatch. - Added two qr_async_trload pipeline variants with spatial padding for gfx950 group mode. ## Test Plan Triggering AITER CI job: ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
34084aeb11 |
[CK_TILE] fix(fmha): support >2GB KV cache in batch prefill via template dispatch (#6653)
## Motivation
The CK batch prefill kernel previously failed (silent overflow + page
faults) when the KV cache exceeded 2 GB, blocking long-context inference
workloads (e.g., 128K+ token contexts with paged KV).
Two distinct failure modes were addressed:
1. **>4GB SRD overflow (`page_size < kN0`):** The SRD
`buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small
page sizes the rebased SRD spans the full KV pool and the offset wraps
past 2 GB, corrupting K/V loads.
2. **gfx950 page-table fault (`page_size >= kN0`):** On CDNA4 the
hardware validates the **full SRD `num_records` range** against
page-table permissions (CDNA3 only checks per-instruction `voffset`).
After per-tile SRD rebase, an un-trimmed `num_records` field extends
past the live page and faults on freed/protected memory.
## Technical Details
**Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad`
template parameter:**
| Case | `page_size` | KV cache size | Mode | Load path | Addressing |
|---|---|---|---|---|---|
| 1 | `>= kN0` (large pages) | any | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, bounded by per-page rebase |
| 2 | `< kN0` (small pages) | `<= 2 GB` | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, fits in INT32 byte range |
| 3 | `< kN0` (small pages) | `> 2 GB` | Global-load
(`kUseGlobalLoad=true`) | `async_load_tile_raw_flat` (K) +
`load_tile_flat` (V) | 64-bit |
**Dispatch:** the auto-gen API layer (`fmha_batch_prefill.py`) selects
the kernel instantiation at launch from `(page_block_size,
num_total_pages * batch_stride_k * kElementBytes)`, so the small-page
penalty is paid only when correctness requires it.
**gfx950 SRD `num_records` trimming:** in the K and V rebase lambdas of
`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`,
`set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after
each rebase to constrain `num_records` to the live page. Required for
CDNA4 page-table validation; harmless on CDNA3.
**Pipeline sync for the global-load path:**
- V uses synchronous `load_tile_flat`; K uses
`async_load_tile_raw_flat`.
- `v_physical_pages_current` is double-buffered so the V flat load
doesn't race against the next iteration's K rebase computation.
**Arch guards:** `global_load_lds` intrinsics are gated to `__gfx94__` /
`__gfx950__` (CDNA3+). Other architectures hit a `dependent_false`
static_assert with a descriptive message.
**Device-side assertion convention:** SRD setters use
`__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s
`assert()`. The latter introduces an `__assert_fail` call whose register
pressure scatters the K-SRD scalar register window across conditional
branches, corrupting `buffer_load_dwordx4` on gfx950.
## Test Plan
Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper
test suite. All coverage lives in **`op_tests/test_batch_prefill.py`**:
- **Functional matrix (96 cases)** — `test_batch_prefill`: `page_size ∈
{1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16,
fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1,
4}` (parametrized to exercise per-sequence SRD rebase across batch
boundaries).
- **>2 GB coverage** — `test_batch_prefill_large_kvcache`: extended to
allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true`
(small-page) and `kUseGlobalLoad=false` (large-page rebase) paths.
Includes both single-batch and multi-batch (`batch_size=4`) cases to
exercise per-sequence SRD rebase across the >2 GB pool.
- Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol`
from the existing batch prefill test harness.
## Test Result
| Arch | `test_batch_prefill` | `test_batch_prefill_large_kvcache` (>2
GB) |
|------|----------------------|---------------------|
| MI308 (gfx942) | All passed | Passed |
| MI355 (gfx950) | All passed | Passed |
**Performance impact (gfx950, hot SRD path):**
- +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang /
causal / soft_cap=30`, attributable in full to the two
`set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas
(5-run median, signal/noise ≈ 9×).
- This cost is **mandatory for gfx950 correctness** on >2 GB workloads —
removing the setters re-introduces page-faults.
- gfx942: 0 regressions in the same range (all configs ≤ +0.97%).
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
803874c73b |
CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479)
## Motivation
The existing paged-KV attention pipelines (pagedkv, splitkv) support
StreamLLM-style sink tokens — a fixed set of initial tokens kept in
attention alongside the sliding window. The `batch_prefill` pipeline
(chunked-prefill with VLLM-style block tables) previously hardcoded
`kHasSink = false`, making it incompatible with sink-based attention
patterns in LLM serving scenarios.
This PR extends `batch_prefill` to support `kHasSink` and wires it
into `fmha_fwd_runner` for validation against the existing CPU
reference.
## Technical Details
**Pipeline** (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`):
- When `kHasSink`, the K/V loop splits into a sink phase [0,
sink_seq_end)
and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv.
- K advance at the sink→window transition jumps
`seqlen_k_start - sink_seq_end + kN0` to bridge the gap.
- V scatter-gather offsets are re-initialized at the transition to fix a
window mismatch bug: V was lagging kN0 behind K after the large jump,
loading from the wrong sequence position.
- Bias window, dropout seq_offset, and mask type (LogitsSinkMask)
updated
for sink-awareness.
**Traits / codegen** (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`,
`fmha_batch_prefill.py`):
- `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded
`false`).
- Codegen adds `F_sink` field; skips batch-mode kernels (group mode
required).
- CMake test filter broadened from 9 → 33 instances covering
fp16/bf16 × mask/nmask × lse/nlse × sink/nsink.
**Runner** (`fmha_fwd_runner.hpp`, `CMakeLists.txt`):
- `fmha_batch_prefill()` dispatched from `run_fwd` when:
group mode + paged KV + num_splits == 1.
- K/V strides corrected for runner's [num_pages, nhead_k,
page_block_size, hdim] layout.
- `page_block_size % 128` check relaxed: batch_prefill supports ps=16.
- CPU reference paged-KV reordering guards extended with
`CK_TILE_FMHA_FWD_BATCH_PREFILL_API`.
## Test Plan
Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run
`tile_example_fmha_fwd` in group mode with page_block_size=16.
Test matrix:
- Mask: no-mask, causal, sliding window
- Sink: nsink, sink=1..128
- dtype: fp16, bf16
- LSE output: on/off
- seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024}
- GQA, chunked prefill, large batch×seqlen
- page_block_size: 16, 32
## Test Result
171 test cases, all valid:y:
- nmask + nsink: ✓
- causal + nsink: ✓
- causal + sink=8: ✓
- sliding window + sink=8 (d=128, d=256): ✓
- bf16, LSE output, GQA: ✓
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
2574f37483 |
[CK_TILE] Enable canonical-NaN BF16 conversion for FMHA on RDNA (#6253)
## Motivation
- On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA
forward adds noticeable overhead and causes a meaningful performance gap
versus fp16. The asm-based path (mode 3) does not improve this on RDNA
and can perform even worse.
- In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower
than the corresponding fp16 path.
- This PR reduces that gap by switching FMHA forward to a different BF16
conversion strategy based on Triton’s canonical-NaN
round-to-nearest-even behavior.
## Technical Details
- Add a new `standard_cnan` BF16 conversion mode to CK Tile.
- Implement a canonical-NaN RTN `float -> bf16` conversion path based on
the Triton implementation.
- Enable this conversion mode by default for FMHA forward builds
targeting gfx11/gfx12.
- Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some
`hdim=128` cases to keep kernel selection aligned with the updated
conversion behavior.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- all tests passed when running `test_ck_tile_fmha`
- BF16 FMHA forward performance improves by up to ~5% on gfx11.
- BF16 FMHA forward performance improves by up to ~10% on gfx12.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
f5e00ec904 |
[CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450)
## Motivation
`qr_hpad` currently executes work for padded head-dim fragments even
when only a subset of the values are valid. This adds unnecessary
computation for head dimensions that require padding, such as `hdim=72`
and `hdim=80`, and hurts FMHA forward performance.
The goal of this PR is to make the padded-head-dim path skip invalid
work based on the actual valid fragment count, while preserving the
existing behavior for the non-padded path.
## Technical Details
This PR improves the `qr_hpad` FMHA forward path in three parts:
- Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part
of the fragment is valid.
- Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the
valid fragment range instead of always computing over the padded extent.
- Retune the gfx11 `qr_hpad` kernel configuration after enabling the
partial-fragment path.
To keep the existing path stable, the implementation adds overloads for
the updated GEMM/pipeline interfaces. This allows existing full-tile
callers to keep using the previous form, while the `qr_hpad` path can
pass valid fragment counts when needed.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- On gfx11 and gfx12, for head dimensions that require padding,
`tile_example_fmha_fwd` shows about 20-30% performance improvement at
`hdim=72/80`.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
65ad35becd |
[CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12 (#6156)
## Motivation
On gfx11/gfx12, FMHA forward kernels that require head-dim padding show
a large performance drop compared to the exact-head-dim path. In
practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too
far off the fast path.
This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while
keeping the behavior for other GPUs unchanged.
## Technical Details
- Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path
for gfx11/gfx12.
- For `receipt=0`, keep support conservative and only enable the padded
fast path for vector-safe cases (`head_dim % 8 == 0`), matching the
existing assumption used on other GPUs.
- Move `v_prefetch` later only for the head-dim-padded path on
gfx11/gfx12. This reduces live ranges and removes the register-spill
behavior seen in the earlier scheduling.
- Enable the buffer-load OOB check offset trick for the padded path on
gfx11/gfx12.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
Observed padded-head-dim performance improvements for HDIM=72/80:
- gfx11: about ~3.5x
- gfx1151: about ~2.0x
- gfx12: about ~1.3x
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
6dc44114ba |
[CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051)
## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details **Warp GEMM:** - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries **V3 Kernel (`fmha_fwd_v3_kernel.hpp`):** - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs **V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`):** - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` **V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`):** - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) **Codegen (`fmha_fwd.py`):** - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor **Misc:** - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) | Problem | Regular (TFlops) | Varlen (TFlops) | |---|---:|---:| | batch=1 heads=6/1 seqlen=1024 causal | 48.9 | 47.6 | | batch=1 heads=6/1 seqlen=2048 causal | 119.8 | 117.4 | | batch=1 heads=6/1 seqlen=4096 causal | 263.7 | 259.2 | | batch=1 heads=6/1 seqlen=8192 causal | 548.9 | 543.6 | | batch=1 heads=6/1 seqlen=16384 causal | 1043.0 | 1063.7 | | batch=1 heads=6/1 seqlen=32768 causal | 1237.2 | 1279.6 | | batch=1 heads=6/1 seqlen=65536 causal | 1315.4 | 1382.7 | | batch=1 heads=6/1 seqlen=131072 causal | 1326.3 | 1402.2 | | batch=1 heads=16/1 seqlen=65536 causal | 1298.7 | 1388.4 | | batch=1 heads=40/40 seqlen=37200 non-causal | 1248.9 | 1326.1 | ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
5370485459 |
[CK_TILE] Restrict FMHA codegen to the kernel subset used by FlashAttention (#6038)
## Motivation Currently, the CK FlashAttention integration generates a broader FMHA kernel set than the FlashAttention wrappers can actually dispatch, which increases compile time without improving runtime coverage. ## Technical Details The FlashAttention CK wrappers do not use all logits/LSE variants emitted by the default FMHA codegen. The direct `fmha_fwd` path always uses softcap-disabled, LSE-enabled kernels, and the `fmha_fwd_splitkv` path only uses softcap-disabled kernels. This change trims codegen to that subset and stops generating the unused logits/LSE variants. This reduces the generated forward kernel set without changing `fmha_fwd_appendkv` or `fmha_bwd`. The reduced kernel set was validated by building and running the [FlashAttention](https://github.com/Dao-AILab/flash-attention) CK backend. Across targets, the total generated FMHA kernel count is reduced by: - `gfx942`: 29.3% - `gfx1100`: 33.7% - `gfx1201`: 31.3% ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> pytest test/test_flash_attn_ck.py from https://github.com/Dao-AILab/flash-attention ## Test Result all tests passed <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
ba0efe01af |
[CK Tile] Add sink token gradient support in FMHA backward pass (#5504)
## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
47fb489d78 | [CK_TILE] Update gfx12 FMHA forward kernel configs (#5798) | ||
|
|
fa912ed457 |
[CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill (#5918)
## Motivation
FP8 batch prefill kernels currently only support head_dim=128. Models
with head_dim=256 hit the "invalid argument for batch_prefill" error
because no matching kernel variant exists in the codegen dispatch.
## Technical Details
Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen
recipe (`fmha_batch_prefill.py`).
Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1,
4,1,1, 32,32,32, 32,32,32, -1)`
- bm0=128, bn0=128 (Q/K tile sizes)
- bn1=256, bk0max=256 (V head_dim=256)
- Warp MFMA 32x32x32 (fp8 MFMA instructions)
This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp
sizes.
## Test Plan
Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch
prefill test with the following matrix:
- page_size: {1, 16, 1024}
- kv_layout: {linear, vectorized}
- lookup_table: {sglang, vllm}
- causal: {true, false}
- logits_soft_cap: {0.0, 30.0}
- contiguous_kv: {true, false}
## Test Result
**MI308X (gfx942):** 160 passed, 32 skipped (page_size=1 + vectorized
not applicable)
**MI355X (gfx950):** 120 passed, 72 skipped (pre-existing ROCm 7.2
compiler issue with causal + no softcap)
No register spills on either platform.
### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8
heads
| page_sz | kv_layout | table | causal | soft_cap | time_us | TFLOPS |
|---------|-----------|-------|--------|----------|---------|--------|
| 1 | linear | sglang | False | 0.00 | 55.01 | 156.16 |
| 1 | linear | vllm | False | 0.00 | 55.12 | 155.84 |
| 1 | linear | sglang | False | 30.00 | 62.63 | 137.16 |
| 1 | linear | vllm | False | 30.00 | 62.16 | 138.20 |
| 1 | linear | sglang | True | 30.00 | 64.09 | 67.01 |
| 1 | linear | vllm | True | 30.00 | 63.85 | 67.27 |
| 16 | linear | sglang | False | 0.00 | 57.00 | 150.69 |
| 16 | vectorized | sglang | False | 0.00 | 57.55 | 149.25 |
| 16 | linear | vllm | False | 0.00 | 56.80 | 151.23 |
| 16 | vectorized | vllm | False | 0.00 | 57.32 | 149.87 |
| 16 | linear | sglang | False | 30.00 | 64.77 | 132.62 |
| 16 | vectorized | vllm | False | 30.00 | 63.54 | 135.18 |
| 16 | linear | sglang | True | 30.00 | 66.84 | 64.26 |
| 16 | vectorized | vllm | True | 30.00 | 66.12 | 64.96 |
| 1024 | linear | sglang | False | 0.00 | 58.25 | 147.46 |
| 1024 | vectorized | sglang | False | 0.00 | 57.53 | 149.31 |
| 1024 | linear | vllm | False | 0.00 | 58.06 | 147.94 |
| 1024 | vectorized | vllm | False | 0.00 | 57.55 | 149.27 |
| 1024 | linear | sglang | False | 30.00 | 65.38 | 131.38 |
| 1024 | vectorized | vllm | False | 30.00 | 63.64 | 134.98 |
| 1024 | linear | sglang | True | 30.00 | 66.85 | 64.25 |
| 1024 | vectorized | vllm | True | 30.00 | 65.26 | 65.81 |
### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8
heads
| page_sz | kv_layout | table | causal | soft_cap | time_us | TFLOPS |
|---------|-----------|-------|--------|----------|---------|--------|
| 1 | linear | sglang | False | 0.00 | 110.18 | 77.96 |
| 1 | linear | vllm | True | 30.00 | 134.33 | 31.97 |
| 1 | linear | sglang | True | 30.00 | 134.59 | 31.91 |
| 16 | linear | sglang | False | 0.00 | 115.43 | 74.42 |
| 16 | vectorized | sglang | False | 0.00 | 106.11 | 80.95 |
| 16 | linear | vllm | False | 0.00 | 116.34 | 73.83 |
| 16 | vectorized | vllm | False | 0.00 | 106.17 | 80.91 |
| 16 | linear | sglang | False | 30.00 | 135.61 | 63.34 |
| 16 | vectorized | vllm | False | 30.00 | 122.37 | 70.20 |
| 16 | linear | sglang | True | 0.00 | 117.44 | 36.57 |
| 16 | vectorized | vllm | True | 0.00 | 108.81 | 39.47 |
| 16 | linear | sglang | True | 30.00 | 139.43 | 30.80 |
| 16 | vectorized | vllm | True | 30.00 | 125.87 | 34.12 |
| 1024 | linear | sglang | False | 0.00 | 110.65 | 77.63 |
| 1024 | vectorized | sglang | False | 0.00 | 101.70 | 84.46 |
| 1024 | linear | vllm | False | 0.00 | 111.71 | 76.89 |
| 1024 | vectorized | vllm | False | 0.00 | 101.55 | 84.59 |
| 1024 | linear | sglang | False | 30.00 | 129.33 | 66.42 |
| 1024 | vectorized | vllm | False | 30.00 | 120.95 | 71.02 |
| 1024 | linear | sglang | True | 0.00 | 112.26 | 38.26 |
| 1024 | vectorized | vllm | True | 0.00 | 103.02 | 41.69 |
| 1024 | linear | sglang | True | 30.00 | 133.73 | 32.12 |
| 1024 | vectorized | vllm | True | 30.00 | 124.75 | 34.43 |
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
b5894b3cbe |
[CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018)
## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 56.27 | 61.48 | 1.09x 4096 | 67.10 | 72.27 | 1.08x 8192 | 65.99 | 71.64 | 1.09x 12288 | 61.60 | 76.61 | 1.24x 16384 | 58.99 | 75.74 | 1.28x 20480 | 57.32 | 74.42 | 1.30x 24576 | 56.89 | 74.25 | 1.31x 27280 | 18.93 | 24.48 | 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 66.79 | 65.90 | 0.99x 4096 | 85.90 | 86.80 | 1.01x 8192 | 77.06 | 90.29 | 1.17x 12288 | 58.36 | 88.98 | 1.52x 16384 | 52.12 | 88.88 | 1.71x 20480 | 48.11 | 88.42 | 1.84x 24576 | 47.12 | 89.07 | 1.89x 27280 | 49.05 | 50.31 | 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
f6bfcad437 |
[CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174)
## Motivation This PR enables a persistent-kernel execution path for FMHA backward (dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is split, stored, and converted back to final gradients. ## Technical Details - Introduces a persistent-kernel grid mapping in deterministic mode and updates split-count calculation accordingly. - Extends kernel kargs to carry batch-related info needed for persistent scheduling and dQ conversion. - Refactors dQ store conditions and adds mask-type traits/utilities and runner logging updates. ## Test Plan - Jenkins [base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline) - Jenkins [AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline) - Jenkins [FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline) - local FA tests ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
25d9fdfc16 |
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368)
## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
559ad6f0b1 |
[CK_TILE] Update gfx11 FMHA forward kernel configs (#5088)
## Motivation Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded seqlen_q/k) cases. This tuning is based on heuristic search and improves performance in most tested shapes. Performance should be evaluated on top of [`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018) (required baseline). ## Technical Details - Updated gfx11 codegen heuristic choices for tile size and occupancy. - Updated gfx11 pipeline selection: - Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad cases are dispatched to the faster kernel path.` - Kept gfx12 unchanged: with PSSK support from [`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957), existing gfx12 config is already sufficient. - Tuning rationale: - In some cases, higher `kBlockPerCu` lowers register pressure. - On RDNA, this generally aligns with better performance when `waves_per_eu >= 6`. ## Test Plan - test_ck_tile_fmha - tile_example_fmha_fwd: tested this on gfx1100 and gfx1151 ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - TFLOPs by sequence length target: `gfx1100` layout: `bhsd` - mode: batch / VGPR usage: 225 vs 214 SeqLen | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 74.10 | 71.97 | 0.97x 4096 | 66.26 | 77.79 | 1.17x 8192 | 68.18 | 75.88 | 1.11x 12288 | 68.47 | 80.44 | 1.17x 16384 | 59.54 | 79.66 | 1.34x 20480 | 55.78 | 77.91 | 1.40x 24576 | 55.08 | 77.47 | 1.41x 27280 | 47.45 | 77.16 | 1.63x - mode: group / VGPR usage: 256 vs 214 SeqLen | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 71.47 | 70.6 | 0.99x 4096 | 64.74 | 77.06 | 1.19x 8192 | 64.68 | 75.47 | 1.17x 12288 | 66.43 | 79.95 | 1.20x 16384 | 56.02 | 79.73 | 1.42x 20480 | 50.21 | 78.15 | 1.56x 24576 | 47.29 | 77.53 | 1.64x 27280 | 46.13 | 77.04 | 1.67x - TFLOPs by sequence length target: `gfx1151` layout: `bshd` - mode: batch / VGPR usage: 225 vs 223 Batch | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 26.85 | 29.17 | 1.09x 4096 | 24.75 | 26.01 | 1.05x 8192 | 25.24 | 25.50 | 1.01x 12288 | 25.18 | 25.00 | 0.99x 16384 | 24.79 | 25.91 | 1.05x 20480 | 25.56 | 25.24 | 0.99x 24576 | 25.13 | 26.20 | 1.04x 27280 | 10.78 | 26.35 | 2.44x - mode: group / VGPR usage: 256 vs 229 Batch | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 27.44 | 26.71 | 0.97x 4096 | 21.89 | 23.09 | 1.05x 8192 | 22.85 | 24.49 | 1.07x 12288 | 24.33 | 24.42 | 1.00x 16384 | 20.05 | 24.98 | 1.24x 20480 | 14.70 | 25.15 | 1.71x 24576 | 11.30 | 26.31 | 2.33x 27280 | 10.10 | 26.32 | 2.61x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
4a6638adcf |
[CK_TILE][FMHA] Extend pipelines with pssk for gfx11/12 (#4957)
## Motivation Build pipelines with seqlen padding only to support vectorized loads in the hdim dimension. The existing pipelines have either all dims padded or all dims not padded. These pipelines can be used in ComfyUI for slightly better performance. ## Technical Details Also a fix included for correct FLOPS calculation in `tile_example_fmha_fwd` when `seqlen_q * seqlen_k` overflows index_t capacity (signed int32). ## Test Plan The existing test cases will use the new pipelines when parameters allow (seqlens - padded, hdims - not padded): ``` ninja test_ck_tile_fmha_fwd bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_fwd_fp8bf16 # for gfx12 ``` ## Test Result All tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
e454358113 |
[CK_TILE] FMHA BWD Launcher Interface (#4577)
## Motivation Reduce memory usage; Be prepared to implement optimizations of reducing nsplits in deterministic cases. ## Technical Details This PR introduces a new launcher interface for the FMHA backward operation, replacing direct function calls with a more structured approach. The launcher encapsulates kernel dispatch logic and provides access to computed metadata like the number of dQ acc splits. **Changes:** - Added `fmha_bwd_launcher` class that wraps kernel execution and exposes `dq_acc_splits` - Moved `fmha_bwd_traits` construction earlier in the execution flow to support launcher initialization - Refactored code generation to produce both legacy API and new launcher constructor ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
5cc04d7676 |
[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045)
## Motivation Still addressing errors that are blocking the merge of TheRock PR: https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382 ## Technical Details 1. There are multiple fmha python scripts that are writing native paths which are confusing cmake. I addressed one of these in an earlier PR https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing more that are exposed with gfx1200 target: ``` [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` 2. In the following compiler error we see gemm_prec_str<ADataType, BDataType> being passed as a function to concat(...), instead of being evaluated with the parenthesis operator(), i.e., gemm_prec_str<ADataType, BDataType>(). There are multiples instances of this, I wonder what non-msvc compilers do here: ``` [composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast] [composable_kernel] 119 | ((oss << sep << rest), ...); [composable_kernel] | ^~~~ [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here [composable_kernel] 248 | return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName()); [composable_kernel] | ^ ``` There are plenty of other places where we use gemm_prec_str with the operator(), so I'm pretty sure these were just typos...but I'd like some eyes on it. 3. There are 2 tests that fail to build on Windows, which I've excluded from the build but will open bug tickets for: 1. gemm_weight_preshuffle 2. grouped_gemm_preshuffle Here's a sample of the compiler error for these tests: ``` [composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations] [composable_kernel] 110 | const char* vp = std::getenv(name); [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here [composable_kernel] 1183 | _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s) [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE' [composable_kernel] 368 | #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT( \ [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT' [composable_kernel] 358 | #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text)) [composable_kernel] | ^ [composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation) [composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918) [composable_kernel] Target: x86_64-pc-windows-msvc [composable_kernel] Thread model: posix [composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin [composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s). [composable_kernel] ninja: build stopped: subcommand failed. [composable_kernel FAILED WITH CODE 1 in 238 seconds] ninja: build stopped: subcommand failed. ``` ## Test Plan Wait for internal CI and make sure build compiles locally. ## Test Result Waiting on CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
25560c26fb |
[CK] Fix gptoss sink (#4313)
## Motivation This PR removes conditional logic for handling infinity values in the sink mechanism across multiple FMHA pipeline implementations, defaulting sink_size to 0 and adding a constraint in the kernel selection logic. ## Technical Details Changes: Removed __builtin_isinf_sign(sink_v) checks and conditional initialization of LSE accumulators across 7 pipeline files Added default initialization (= 0) for sink_size in 4 argument structs Added F_sink == "f" constraint to kernel compatibility checking ## Test Plan Local test ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> |
||
|
|
7bd29c3845 |
[CK] Use as_posix() instead of str() for paths in fmha_fwd_appendkv.py (#4812)
## Motivation This is causing a failing PR for Windows: https://github.com/ROCm/TheRock/pull/3382 ``` [composable_kernel configure] -- Jenga kernel files to be generated: B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_api.cpp [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` ## Technical Details The file: [fmha_fwd_appendkv.py](https://github.com/ROCm/rocm-libraries/compare/users/brockhargreaves-amd/ck/fix-windows-cmake-path-problem?expand=1#diff-bef22bf9ba21eb93c725493ecc7edcb6f2a8f0a9a173dcfca6bda7a9f4eced78) writes a bunch of paths to a text file which is later parsed by cmake. When passing a pathlib.Path to str(), str() converts to a native path, in this case / to \\ on Windows which confuses cmake. In this case we need to write paths with forward slashes and then pass those onward to cmake. ## Test Plan 1. Ensure this doesn't impact existing CI. 2. Ensure compilation of Windows pass locally. ## Test Result 1. Passes existing CI 2. This fixes the compilation error locally. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
ce6acc5f66 |
[CK_TILE][FMHA] Support gfx11 (#4584)
## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
fdb1a08e6f |
Enable group mode (varlen) kernel generation for PyTorch integration (#4292)
## Proposed changes This PR enables group mode (variable-length attention) kernel generation for PyTorch's CK SDPA backend. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion The change is minimal (single line deletion) but enables a significant feature: variable-length attention support for ROCm users via PyTorch's torch.nn.attention.varlen API. --- 🔁 Imported from [ROCm/composable_kernel#3553](https://github.com/ROCm/composable_kernel/pull/3553) 🧑💻 Originally authored by @chinmaydk99 Co-authored-by: Chinmay_Kuchinad <ChinmayDattanand.Kuchinad@amd.com> |
||
|
|
9c0d4114ae |
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263)
Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696) 🧑💻 Originally authored by @Jeff-Huang --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: Illia Silin <Illia.Silin@amd.com> |
||
|
|
e38029e946 |
Extend CK fmha_batch_prefill kernel coverage to head_dim=256 (#3328)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit:
|
||
|
|
90b3476006 |
Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit 723b7ce0be2884da131036301892bf9157f51876.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit:
|
||
|
|
4ded7e5984 |
Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit ceccf15275645cc64db0a4ae53f5a215c93a7969.
[ROCm/composable_kernel commit:
|
||
|
|
f6fac4cea6 |
[CK_TILE][FMHA]Add new tile size for async (#3623)
* Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)"
This reverts commit cfdad49edda4b2ccef92571f23646a8505bb2859.
* Add new tile_size for async pipeline
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
[ROCm/composable_kernel commit:
|
||
|
|
14254656f0 |
Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel * add block scale to kernel * add smoke test * format * Revert "format" This reverts commit |
||
|
|
e227e837be |
Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)
This reverts commit 217ac48fd83deef3d0d5084815689e8c79958cc1.
[ROCm/composable_kernel commit:
|
||
|
|
ecda0fe2e9 |
[CK_TILE][FMHA] Add new tile size for async (#3586)
* add new tile size for async
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix lse error
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
---------
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
[ROCm/composable_kernel commit:
|
||
|
|
445ec888ba |
[FMHA] Enable page size 16 for batch prefill kernel (#3568)
* [FMHA] Enable page size 16 for batch prefill kernel
* Refactor batch prefill KV offset logic to simplify template arguments
- Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args.
- Add static assert to forbid `page_size=1` with vectorized layout.
[ROCm/composable_kernel commit:
|
||
|
|
99b88be5fb |
[FMHA] Support page_size=1 (linear layout) in batch prefill pipeline (#3545)
- Enable page_size=1 support in batch prefill codegen (linear layout only).
- Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly.
- Relax `kPageBlockSize` alignment assertion for the page_size=1 case.
[ROCm/composable_kernel commit:
|
||
|
|
fd84daec4c |
[FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter
* add is_sglang_layout to parameters
* add kv_offset_array_transform to batch async for page size 16
* add kv_last_page_lens to kernel
* change kv layout to [num_total_pages, page_block_size, hdim]
* format
* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)
* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024
* fix python format
* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]
* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
traits
3. Add some static check for Page size, vector size, hdim, ..., etc.
* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration
Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.
**Changes:**
* Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
* Updated Kernel, Pipeline, and Traits to template on these Enums.
* Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
* Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
* Updated CodeGen scripts to support new parameters.
This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.
* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.
* fix vgpr spill count
* remove unnecessary t2s functions
* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py
* support linear kv cache layout
* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.
* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.
* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.
* update changelog
---------
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
[ROCm/composable_kernel commit:
|
||
|
|
a2402950de |
[CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425)
* Add fp8bf16 support for batch_prefill
* Fix wrong scale_s re-compute logic in batch_prefill
* Fix wrong scale_s re-compute logic in fmha fwd
* Fix batch_prefill codegen error
* Remove no-longer used GetName() function
* Add fp8 logits=True instances
* Update CHANGELOG.md
[ROCm/composable_kernel commit:
|
||
|
|
97556d24f2 |
[CK_TILE][FMHA] Add logits soft-capping support for FAv3 (WIP) (#3355)
* Let fmha_fwd_v3() compatible with fmha_fwd()
* Decouple get_fwd_blobs() and FmhaFwdKernel
* Decouple compatibility checks from get_fwd_blobs()
* Extract product feature checks out from get_fwd_blobs()
* Remove duplicated code in factories and redundant checks
* Remove FmhaFwdKernel<>::GetName()
* Let FmhaFwdApiPool support pipelines with different mask_impl
* Add tile setting for fmha fwd v3 pipeline
* Add fwd v3 instances to tile_example_fmha_fwd manually
* Remove unused function import
* Undo irrelevant changes
* Remove fwd v3 instances from tile_example_fmha_fwd
* Finish fmha fwd v3 kernel instance codegen
* Fix formatting
* Remove unused F_idx attribute
* Add is_generic_attention_mask<> traits
* Add constraints to the fmha fwd v3 pipeline
* Unify traits & problem used for fmha fwd v3
* Unify kernel launch code for fmha fwd v2 & v3
* Unify kernel template selection logic
* Use same kernel codegen template for both v2 & v3
* Rename api() property as render() method
* Allow specifying filter for fmha fwd api pool
* Allow specifying function name when rendering api pool items
* Separate fmha fwd v3 kernel dispatching logic from v2
* Remove lambda assignment
* Add simple v2/v3 dispatch logic
* Stop generating empty if-clauses
Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them.
* Use "".join() to concatenate fmha fwd api string content
* Add more feature checks for fmha fwd v3 pipeline
* Check features before dispatch to fmha_fwd_v3()
* Add more feature checks for fmha_fwd_v3()
* Add missing filter call
* Use Tuple to reserve the dtype orders
* Fix wrong pipeline matching logic
* Add fmha fwd v3 group mode instances
* Add functor_transform<>
* Add type constraints to make_tile_window()
* Remove fmha fwd v3 example
* Fix wrong product(aiter mha_fwd()) config
* Fix wrong fmha fwd v2/v3 selection logic
* Fix formatting
* Add comment to warning v3 kernel users
* Fix wrong codegen logics
* Remove unnecessary param
* Fix format
* Add logits soft-capping support for fmha fwd v3 pipeline (WIP)
* Add missing Kargs base type
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit:
|
||
|
|
c8397e8ef2 |
flashattention fwd add (80, 96) instance (#3415)
* add hdim (96,96) instance
* change to (80,96)
* format py
* remove 96 in optdim
* when N=6 change to llvm_amdgcn_raw_buffer_load_i32x3
[ROCm/composable_kernel commit:
|
||
|
|
97b2015929 |
Fix FMHA fp8 hdim=64 incorrect result in MI200 (#3423)
* Fix incorrect result in hdim=64
* Add change log
[ROCm/composable_kernel commit:
|
||
|
|
51886bf22b |
Add attention sink support for FMHA FWD (#3368)
* Revert "Revert "Add attn sink (#2892)" (#3250)"
This reverts commit e3be392d13e6ee107d823af32aca2d3ff03ca69d.
* fix conflict
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* Add F_sink parameter to FmhaFwdPipeline
* Update tile_fmha_traits.hpp
* Refactor pipeline creation in fmha_fwd.py
Updated the pipeline creation logic to include 'sink' parameter in product combinations and adjusted the FmhaFwdPipeline calls accordingly.
* Update fmha_fwd.py
* Update fmha_fwd.py
* Update example/ck_tile/01_fmha/script/correct_test_fwd_sink.sh
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* update CHANGELOG.md
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* Update CHANGELOG with new features and support
* Update fmha_fwd.hpp
* Update CHANGELOG.md
* Update smoke_test_fwd_sink.sh
* Update correct_test_fwd_sink.sh
* Update smoke_test_fwd_sink.sh
---------
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
[ROCm/composable_kernel commit:
|
||
|
|
737c80d47d |
fix: python 3.8 compatibility in fmha codegen (#3388)
[ROCm/composable_kernel commit:
|
||
|
|
d96f632fa1 |
[CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153)
* Let fmha_fwd_v3() compatible with fmha_fwd()
* Decouple get_fwd_blobs() and FmhaFwdKernel
* Decouple compatibility checks from get_fwd_blobs()
* Extract product feature checks out from get_fwd_blobs()
* Remove duplicated code in factories and redundant checks
* Remove FmhaFwdKernel<>::GetName()
* Let FmhaFwdApiPool support pipelines with different mask_impl
* Add tile setting for fmha fwd v3 pipeline
* Add fwd v3 instances to tile_example_fmha_fwd manually
* Remove unused function import
* Undo irrelevant changes
* Remove fwd v3 instances from tile_example_fmha_fwd
* Finish fmha fwd v3 kernel instance codegen
* Fix formatting
* Remove unused F_idx attribute
* Add is_generic_attention_mask<> traits
* Add constraints to the fmha fwd v3 pipeline
* Unify traits & problem used for fmha fwd v3
* Unify kernel launch code for fmha fwd v2 & v3
* Unify kernel template selection logic
* Use same kernel codegen template for both v2 & v3
* Rename api() property as render() method
* Allow specifying filter for fmha fwd api pool
* Allow specifying function name when rendering api pool items
* Separate fmha fwd v3 kernel dispatching logic from v2
* Remove lambda assignment
* Add simple v2/v3 dispatch logic
* Stop generating empty if-clauses
Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them.
* Use "".join() to concatenate fmha fwd api string content
* Add more feature checks for fmha fwd v3 pipeline
* Check features before dispatch to fmha_fwd_v3()
* Add more feature checks for fmha_fwd_v3()
* Add missing filter call
* Use Tuple to reserve the dtype orders
* Fix wrong pipeline matching logic
* Add fmha fwd v3 group mode instances
* Add functor_transform<>
* Add type constraints to make_tile_window()
* Remove fmha fwd v3 example
* Fix wrong product(aiter mha_fwd()) config
* Fix wrong fmha fwd v2/v3 selection logic
* Fix formatting
* Add comment to warning v3 kernel users
* Fix wrong codegen logics
* Remove unnecessary param
* Fix format
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit:
|
||
|
|
228b1e8d87 |
fp8 fmha async pipeline (#3339)
* replace qr with async pipeline
* Add fp8fp32 to DTYPE_BITS
* Add kAlignmentRandVal to avoid compile fail
* format
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
[ROCm/composable_kernel commit:
|
||
|
|
a535de0f75 |
chore(copyright): update copyright header for example directory (#3273)
* chore(copyright): update copyright header for codegen directory
* chore(copyright): update copyright header for example directory
[ROCm/composable_kernel commit:
|
||
|
|
f20f9dd453 |
Fix batch prefill compile fail in aiter (#3279)
* Fix batch prefill aiter compile fail
* Fix compile error
[ROCm/composable_kernel commit:
|
||
|
|
cdd72e57d3 |
Support fp8 dynamic quantization for fmha (#3206)
* Support qscale for dynamic quant, remove static quant
* Support hdim=256
* Remove bias test case for fp8
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
[ROCm/composable_kernel commit:
|
||
|
|
06d2e609cd |
Revert "Add attn sink (#2892)" (#3250)
This reverts commit bbe1d3a917ee92655224c0f1528ace3a7b0e82a8.
[ROCm/composable_kernel commit:
|
||
|
|
f4ba63deb7 |
Add attn sink (#2892)
* enable attn sink
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
* update attn_sink script
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
* fix some error
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
* clang-format
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
* update fmha_bwd mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
* update fmha_bwd_kernel'mask
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
* update block_fmha_pipeline_qr_ks_vs.hpp
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
* fix ci error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* fix format error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* Update block_fmha_bwd_pipeline_default_policy.hpp
* Update fmha_fwd_runner.hpp
* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* Update fmha_fwd_runner.hpp
* update splitkv_pipline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* update splitkv&pagedkv pipeline
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* add sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* update attn_sink result log
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* update smoke_test_fwd_sink.sh
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* update test file
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* update test script
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp
* use constexpr kHasSink for sink in fmha pipeline
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* update by pre-commit
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update fmha_fwd.py
* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Remove causal mask setting logic from mask.hpp
Removed the mask setting logic for causal masks.
* fix ci error that some usage of lamada not support in c++17
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* Update remod.py
* add smoke sink test
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* Update fmha_pagedkv_prefill.py
* Update FmhaFwdPipeline parameters in fmha_fwd.py
* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* fix c++17 unsupprot error
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
* Fix formatting of sink_seq_end assignment
* Fix indentation for sink_seq_end assignment
* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp
---------
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
[ROCm/composable_kernel commit:
|
||
|
|
220bd7a9bb |
[CK_TILE] Support WMMA (gfx12) in FMHA (#2528)
* Pass hdim to tile_example_fmha_fwd in fp8 tests
* Add WMMA support to fwd FMHA pipelines
* Tune tile sizes a bit for less spilling
fp16 256 is still quite slow
* Fix Q grad tile distribution for warp size = 32 and hdim >= 256
With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.
* Use code based on BlockDropout in BlockDropoutBwd
* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal
* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
* any number of warps (power-of-2), not only 4;
* kN1 = 16, not only 32;
* any number of splits;
* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now
* Replace hard-coded kN1 in dispatch code with the requested tile size
* Add gfx12-specific tile sizes for split KV
* Pass GPU architecture to kernel generation scripts
This is still a temporary solution.
* Build and run FMHA CI tests for gfx12
* Fix issue after merging
* Fix bwd tile sizes
The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).
* Use hardware f32->f8 on gfx12, remove v_perm
__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
32-bit dword) is used to store results (two f8 values).
* Update changelog
* Add WMMA support to pagedkv
* Fix scripts after rebasing
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Fix names after cherry-picking
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Do not use filters related to qr_async_trload
They disable tiles/pipelines which are valid for gfx12.
* Use different dstr encoding when C is transposed
* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code
Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().
* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution
Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.
* Generalize usages of WarpGemmDispatcher for MFMA and WMMA
WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.
* Mark has_load_tr as maybe_unused
There are no transpose loading for RDNA.
* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code
* Detect BlockSize on host based on warp size of the current device
If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.
* Fix calculation of grid size for combine kernel with warp size = 32
* Add missing includes and header
* Support multiple archs in one binary for fwd
* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill
* Support multiple archs in one binary for bwd
* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
they can be used as fallbacks (similarly to fwd);
* Extract common code from register_traits
* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"
To simplify merging , the proper fix is in develop already.
* Support new numerical d paddings in trait ordering checks
* Build fp32 tests only on gfx9
* Do not use hardcoded M0 = 64 for dot bwd kernel
* Use textwrap.indent from standard library
* Make fp8 pipelines on gfx12 consistent with gfx9
* Update tests for current pipelines
* Make ninja check more responsive in CI
ninja buffers output so this job looks hanging.
* Support fp8fp32 by limiting O vector size
The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).
* Remove unused cmake options
* Unify including amd_buffer_addressing.hpp/_builtins.hpp
* Temporarily use amd_buffer_addressing.hpp on >=gfx10
amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
* 1 scalar for exec masks instead of 2,
* gfx12 uses different instruction names etc.
* Update asm in bf16 conversions to work with warp 32
* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd
* Add arch tags to kernels/host funcs, compile for each arch separately
* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename
* Add workaround for miscompilation of bwd with padded hdim
SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.
* Fix add_gtest_executable for absolute paths
Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.
* Run only hdim 128 smoke tests for fp8fp32
There are no instances for hdim 64 and 256.
* Format py with ruff to simplify merging develop
* Fix incorrect var name
* Codegen for gfx9,gfx950 when --targets is not specified
Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).
* Combine arch-related values into ArchTrait
This more centralized approach removes duplication of various formatting templates.
* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"
Some code is extracted into a function.
[ROCm/composable_kernel commit:
|