mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 10:09:41 +00:00
ce838e19e5ea1931041537112d85185bb7ce7fa7
535 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ce838e19e5 |
[CK_TILE] FMHA BWD launcher: address PR #7331 review comments (round 2)
- prepare_workspace_async: allocate pinned host staging before enqueuing the dq_acc memset. If pinned_host_alloc throws, no stream work has been issued yet, so the workspace is left cleanly un-prepared rather than half-initialized. - pack_workspace_host catch: note that the H2D queued after the callback will copy indeterminate metadata if the catch fires (kernel will produce wrong results); unlikely since pack only throws on precondition violations. - schedule_pin_staging_release: std::move pin_staging_ into the heap shared_ptr; the next line in prepare_workspace_async overwrites it, so the extra atomic inc/dec from a copy is wasted. |
||
|
|
422b6d6c16 |
[CK_TILE] Fix FMHA BWD workspace upper-bound undersizing in group mode
GetWorkspaceDeviceSizeUpperBound was computing
max_batch * nhead_q * max_seqlen_q * hdim_q
in non-deterministic group mode, but PrepareWorkspaceHost actually returns
nhead_q * seqstart_q[batch] * hdim_q
i.e. it scales with the sum of *padded* per-batch seqlen_q, not max_batch
times the *logical* max. When per-batch padding makes seqstart_q[batch]
exceed max_batch * max_seqlen_q the launcher under-allocates dq_acc, the
kernel writes past the buffer, and tests see either ~42% wrong QGrad
values or a GPU page fault (e.g. test_ck_tile_fmha_bwd_bf16
QKVPadding/23,24,26 corrupt; /27 page-faults).
Fix: replace the (max_batch, max_seqlen_q) pair with a single
total_seqlen_q_padded parameter holding the true total padded q tokens.
Launcher derives it from the trait (group: t.seqlen_q already is the
padded total; batch: t.batch * t.seqlen_q). The four mode formulas
collapse to one:
size = nhead_q * nsplits_factor * total_seqlen_q_padded * hdim_q
where nsplits_factor is 1 for non-deterministic, ceil(max_seqlen_k, kN0)
for deterministic group, and the persistent worker computation for
deterministic non-group (the only branch that still needs max_batch).
No caller-side API change: FA, AITER and the CK runner already pass
q.shape[0] (the padded total) as traits.seqlen_q in group mode.
Verified on gfx1201: full test_ck_tile_fmha_bwd_{bf16,fp16} 672/672 PASS,
0 fail, 0 crash (was 27/28 QKVPadding fails + 1 GPU illegal access).
|
||
|
|
1f4cc34e68 | [CK_TILE] FMHA BWD launcher: address PR #7331 review comments | ||
|
|
d434410e52 | [CK_TILE] Add async workspace prepare to FMHA BWD launcher | ||
|
|
7cc0abc9ba |
[CK] Fix latest batch of staging compiler warnings (#7111)
## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
ed0912fc0a |
[CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on gfx950 (#6529)
[CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on
gfx950
## Motivation
Enable the existing V3 persistent kernel path for CK-Tile FMHA forward
on
gfx950 (MI350X/MI355X). The V3 kernel and codegen infrastructure already
exist but are disabled via hardcoded `F_is_v3_enabled=False`.
This change replaces the compile-time gate with a runtime environment
variable
`CK_FMHA_ENABLE_V3=1` (disabled by default, opt-in). When enabled:
- **Prefill** workloads (seqlen_q > 1) dispatch to V3 persistent
pipeline
- **Decode** workloads (seqlen_q == 1) always use V2 (memory-bound,
better suited)
The V3 persistent kernel uses grid-stride scheduling, XCD-interleave
tile
assignment for L2 locality, LPT reversal for causal masks, and gfx950
async
buffer loads.
## Technical Details
Single file: `example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py`
- Add `#include <cstdlib>` and `<string>` for `std::getenv`
- Replace `{F_is_v3_enabled}` template parameter with runtime env var
check
- Add `seqlen_q > 1` guard (decode always uses V2)
- Remove `.format()` call in `write_fwd_api()`
## Dependencies
Depends on https://github.com/ROCm/rocm-libraries/pull/6501 — builds on
XCD-interleave and LPT scheduling infrastructure.
## Test Plan
- GPU validation on MI300X (gfx942, ROCm 6.4.1):
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128
-prec=bf16 -v=1 -warmup=1 -repeat=3`
- GPU validation on MI350X (gfx950, ROCm 7.0):
- Command (V2): `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096
-d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3`
- Command (V3): `CK_FMHA_ENABLE_V3=1 ./build/bin/tile_example_fmha_fwd
-b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3`
- Command (decode, always V2): `./build/bin/tile_example_fmha_fwd -b=64
-h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1
-repeat=3`
## Test Result
Benchmark results (MI350X, gfx950, ROCm 7.0):
| Config | V2 (TFlops) | V3 (TFlops) | Speedup |
|--------|-------------|-------------|---------|
| Non-causal b=2 h=8 hk=2 s=4096 d=128 bf16 | 696.3 | 884.2 | **+27.0%**
|
| Causal b=2 h=8 hk=2 s=4096 d=128 bf16 | 371.3 | 494.9 | **+33.3%** |
| GQA b=2 h=32 hk=8 s=2048 d=128 bf16 | 671.3 | 831.7 | **+23.9%** |
| LLaMA-70B b=1 h=64 hk=8 s=4096 d=128 bf16 | 761.5 | 927.3 | **+21.8%**
|
| Causal GQA b=2 h=32 hk=8 s=2048 d=128 bf16 | 345.4 | 631.9 |
**+82.9%** |
| Long-seq b=1 h=16 s=16384 d=128 bf16 | 797.8 | 969.9 | **+21.6%** |
| Decode b=64 h=32 hk=8 s=1 s_k=4096 bf16 | 1828 GB/s | — (V2 path) |
unaffected |
Benchmark results (MI300X, gfx942, ROCm 6.4.1):
V3 has 0% effect on MI300X — V3 relies on gfx950 async buffer loads and
falls back to the V2 code path on gfx942. No regression on any config.
| Config | TFlops / GB/s | Time (ms) | Delta vs baseline |
|--------|-------------|-----------|-------------------|
| MHA bf16 b=2 h=8 s=4096 d=128 | 342.98 TFlops | 0.401 | +0.1% |
| MHA fp16 b=2 h=8 s=4096 d=128 | 411.18 TFlops | 0.334 | +4.9% |
| Causal MHA bf16 b=2 h=8 s=4096 d=128 | 232.61 TFlops | 0.296 | +2.4% |
| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 | 320.07 TFlops | 0.429 |
-1.4% |
| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 | 353.91 TFlops | 0.777 |
+1.7% |
| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 | 381.53 TFlops |
1.441 | +1.2% |
| Long-seq bf16 b=1 h=16 s=16384 d=128 | 388.61 TFlops | 5.659 | +1.4% |
| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 | 693.40 GB/s | 1.550 |
+0.3% |
All validation tests pass (`valid:y`) on both MI300X and MI350X.
Additional validation:
- `CK_FMHA_ENABLE_V3=0` correctly falls back to V2 (default behavior
unchanged)
- `CK_FMHA_ENABLE_V3=1` dispatches to V3 for prefill, V2 for decode
- Validation passes across fp16/bf16, batch/group mode,
causal/non-causal
- No regression on decode path
---------
Co-authored-by: Chao Zhou <chaozhou@fb.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
|
||
|
|
1cf336d87a |
[CK_TILE][FMHA] Fix sink un-mask under right-window and emit fp8bf16 batch_prefill sink kernels (#6914)
## Summary
Two related fixes to `ck_tile` FMHA so that StreamLLM-sink +
sliding-window
batch-prefill works correctly for fp8 KV / bf16 compute.
Review the commits in this order:
1. `fmha: emit sink kernels for fp8bf16 batch_prefill`
Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so
the fp8(KV) / bf16(QO) batch-prefill codegen also emits the
`mask=mask_enum::generic_with_sink` variant. Without this the runtime
could not dispatch to a sink-aware kernel for the fp8bf16 path.
2. `fmha: respect right-window in IsOutOfSinkBound`
The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask
branch) used `(i_y + x) > 1` as the gate, which conditioned on the row
index instead of the column index. As a result, queries `1..sink-1`
could attend to *future* sink positions (violating causal /
right-window),
while query `0` fell back to the plain causal mask. The fix replaces the
guard with `i_x < i_y + x` so every query only sees sink columns up to
its own right-window boundary.
3. `fmha: clarify IsOutOfSinkBound predicate comment`
Doc-only follow-up that rewrites the comment above the predicate as a
clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`,
`y < y_total`, `i_y < x_total`).
## Test plan
- [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 +
bf16_dequant
modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`)
now passes for all parametrized shapes.
- [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged —
codegen
diff only adds the `generic_with_sink` variant for fp8bf16; existing
kernel object lists unaffected.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: fengjunda.aml <fengjunda.aml@bytedance.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: root <root@smci350-rck-g03-f12-31.rck.dcgpu>
|
||
|
|
0398b864c3 |
[CK_TILE] Use Unified Workspace for FMHA BWD (#6152)
## Motivation
`dq_acc` is the intermediate accumulation buffer used in FMHA backward
pass for deterministic mode. The current implementation allocates it as
a **single rectangular tensor**:
```
shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q]
```
where `nsplits = launcher.dq_acc_splits` (a single scalar), computed
from `max_seqlen_k` and shared across all batches.
### Problems
1. **Memory waste**: In group mode, each batch may have a different
`seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing
batches with shorter `seqlen_k` to over-allocate in the split dimension.
2. **Interface coupling**: `fmha_bwd_args` exposes internal layout
details such as `stride_dq_acc`, `nhead_stride_dq_acc`,
`batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is
responsible for computing these strides, but this logic belongs inside
the kernel.
### Goals
1. Switch `dq_acc` buffer to a **compact layout**: batches are
concatenated contiguously, with each batch occupying `nhead * nsplits_i
* seqq_i * hdim_q` elements (nhead outermost).
2. **Remove all `*_stride_dq_acc` fields** from `fmha_bwd_args`,
replacing them with a single `workspace_ptr`; the kernel splits this
internally using a fixed layout.
4. `fmha_bwd_launcher` provides a **workspace management interface**:
the caller only needs to allocate GPU memory and call
`prepare_workspace()` — no layout computation required.
5. **Isolate kernel internals from the caller API**: the `dq_acc` layout
(nsplits, strides, buffer size) is determined entirely inside the
launcher/kernel. Future changes to block shape, pipeline type, or
persistent kernel strategy require no modifications to the caller's
`fmha_bwd_args` or workspace allocation logic.
## Technical Details
### Interface Design
#### New fields in `fmha_bwd_traits`
```cpp
struct fmha_bwd_traits
{
int seqlen_q;
int seqlen_k;
int batch;
int max_seqlen_q;
int max_seqlen_k;
int hdim_q;
int hdim_v;
int nhead_q;
int nhead_k;
std::string data_type;
bool is_group_mode;
mask_enum mask_type;
bias_enum bias_type;
bool has_dbias;
bool has_dropout;
bool is_store_randval;
bool is_deterministic;
// New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode).
// seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1
// seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1
const int* seqstart_qs = nullptr;
const int* seqstart_ks = nullptr;
};
```
#### `fmha_bwd_launcher` actual structure
```cpp
struct fmha_bwd_launcher
{
std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{};
// Total workspace size in bytes (host_ws_size + device_ws_size), computed by init().
// Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed).
size_t workspace_size = 0;
fmha_bwd_launcher(const fmha_bwd_traits&);
// Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace,
// and zeros the dq_acc buffer portion (tail of workspace) if required.
// The memory pointed to by device_ws must be >= workspace_size bytes.
std::function<void(void* device_ws)> prepare_workspace{};
template <typename... Args>
float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); }
private:
size_t host_ws_size = 0; // CPU workspace size (nsplits[] + offsets[] arrays)
size_t device_ws_size = 0; // GPU-only data size (dq_acc buffer)
std::unique_ptr<char[]> ws_host; // host-side workspace buffer
public:
template <typename T0, typename T1, typename T2, typename Arch>
void init(const fmha_bwd_traits& traits);
};
```
The `init<>()` template method (invoked by codegen dispatch branches as
`this->init<...>(t)`) is responsible for:
1. Setting the `run` lambda
2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain
`host_ws_size`
3. Allocating `ws_host` (host memory)
4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(),
...)` to fill nsplits/offsets; return value is `device_ws_size`
5. `workspace_size = host_ws_size + device_ws_size`
6. Setting the `prepare_workspace` lambda (captures `this`, calls
`PrepareWorkspaceDevice`)
When no kernel matches the given traits, both `run` and
`prepare_workspace` are initialized to default lambdas that print a
warning to `std::cerr` and return gracefully (no exception).
#### Workspace overall layout
The workspace is managed by `FmhaBwdWorkspaceManager` and consists of
two segments:
```
Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace):
index_t nsplits[batch or 1] — per-batch nsplits array
group mode: batch elements
batch mode / non-deterministic: 1 element
[group mode only] long_index_t dq_acc_offsets[batch+1]
— per-batch element offset (inclusive prefix sum)
offsets[0]=0, offsets[i+1] = offsets[i] + nhead*nsplits_i*seqq_i*hdim_q
Offset host_ws_size (device data segment, device_ws_size bytes):
AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required)
total_elements = sum_i(nhead * nsplits_i * seqq_i * hdim_q)
layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q]
note: seqq_i uses the physical length (including padding)
```
Alignment constant (`ALIGNMENT = 16`):
```
nsplits_size = align_up(sizeof(index_t) * N, 16) // N = batch (group) or 1 (batch/non-det)
offsets_size = align_up(sizeof(long_index_t) * (batch+1), 16) // group mode only
host_ws_size = nsplits_size + offsets_size
dq_acc_offset = host_ws_size // GetDqAccDataOffset(batch)
```
**Key benefits**:
- The kernel reads nsplits/offsets directly from the workspace head — no
device-side recomputation.
- `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline
block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no
longer a template parameter, and multiple dq_dk_dv tiles with different
`F_bn0` values now share a single convert_dq kernel instance (under
receipt 1/2, deterministic convert_dq kernel count drops from ~300 to
60).
- nsplits/offsets are computed on the host and transferred in one
`hipMemcpy`; the dq_acc buffer follows immediately, at the offset given
by `GetDqAccDataOffset`.
#### Workspace size by scenario
| Scenario | `workspace_size` | Notes |
|----------|-----------------|-------|
| **kUseQrQtrDorPipeline** (any mode) | `0` | Writes dq directly; no acc
buffer; `PrepareWorkspaceHost` returns 0 |
| **Non-deterministic + batch mode** | `> 0` | nsplits[1]=1; dq_acc used
for atomic add; `workspace_size = host_ws_size +
batch*nhead*seqlen_q*hdim_q*ebytes` |
| **Non-deterministic + group mode** | `> 0` | nsplits[1]=1; dq_acc
contiguous layout; `workspace_size = host_ws_size +
nhead*seqstart_qs[batch]*hdim_q*ebytes` |
| **Deterministic + group mode** | `> 0` | nsplits[batch],
offsets[batch+1], compact dq_acc; nsplits_i computed independently per
batch |
| **Deterministic + batch mode persistent** | `> 0` | nsplits[1]
(uniform across batches); dq_acc `batch*nhead*nsplits*seqlen_q*hdim_q` |
**NeedsZeroDqAcc** (determines whether `PrepareWorkspaceDevice` calls
`hipMemset`):
- Persistent kernel (deterministic batch mode) or non-deterministic:
**must zero** (atomic add requires zero initialization)
- Deterministic group mode + no mask: **no zeroing needed** (every tile
writes its full region)
- Deterministic + with mask: **must zero** (some blocks are skipped,
leaving uninitialized tiles that would contribute to the reduction)
#### Caller usage
```cpp
// 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction)
fmha_bwd_launcher launcher(fmha_traits);
// 2. Read launcher.workspace_size directly
const auto ws_size = launcher.workspace_size;
// 3. Allocate a single GPU workspace
ck_tile::DeviceMem ws_buf(ws_size);
// 4. Copy nsplits/offsets to GPU head and zero dq_acc if required
launcher.prepare_workspace(ws_buf.GetDeviceBuffer());
// 5. Build args with a single workspace pointer; the kernel splits it internally
fmha_bwd_args args{
...,
ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr, // workspace_ptr
};
launcher(args, stream_config);
```
---
### Key Code Structure
#### FmhaBwdWorkspaceManager (`fmha_bwd_kernel.hpp`, new class)
```cpp
template <typename AccDataType, bool kIsGroupMode, bool kIsDeterministic>
struct FmhaBwdWorkspaceManager
{
static constexpr size_t ALIGNMENT = 16;
// CPU workspace (nsplits + offsets) sizes
static size_t GetDqAccSplitsSize(int batch); // align_up(sizeof(index_t)*N, 16)
static size_t GetDqAccOffsetsSize(int batch); // group mode only: align_up(sizeof(long_index_t)*(batch+1), 16)
static size_t GetWorkspaceHostSize(int batch); // = SplitsSize + OffsetsSize
// Starting offset of dq_acc data within the full workspace (= host_ws_size)
static size_t GetDqAccDataOffset(int batch); // = GetWorkspaceHostSize(batch)
// Fills nsplits/offsets in the CPU workspace; returns device_ws_size (dq_acc buffer bytes)
template <bool kUseQrQtrDorPipeline, index_t kN0>
static size_t PrepareWorkspaceHost(void* cpu_ws, index_t batch_size, index_t hdim_q,
index_t nhead_q, index_t seqlen_q, index_t seqlen_k,
const index_t* seqstart_qs, const index_t* seqstart_ks);
// hipMemcpy's cpu_ws to device_ws head; hipMemset's the dq_acc portion to 0 if required
template <bool kUseQrQtrDorPipeline, bool kHasMask>
static void PrepareWorkspaceDevice(void* device_ws, const void* host_ws,
size_t device_ws_size, size_t host_ws_size);
};
```
#### workspace_ptr parsing (inside the kernel)
The kernel parses three address regions from `kargs.workspace_ptr`:
**Group mode (`FmhaBwdDQDKDVKernel::MakeKargs`)**:
```cpp
const uint8_t* ws = reinterpret_cast<uint8_t*>(workspace_ptr);
// dq_acc_ptr (stored in FmhaBwdCommonKargs)
ws + WorkspaceManager::GetDqAccDataOffset(batch)
// dq_acc_batch_offset_ptr (FmhaBwdGroupModeKargs field)
reinterpret_cast<const long_index_t*>(ws + WorkspaceManager::GetDqAccOffsetsOffset(batch))
```
**Batch mode**:
```cpp
ws + WorkspaceManager::GetDqAccDataOffset(batch) // dq_acc_ptr
// No offsets pointer; batch offset is computed inside run_() from nsplits
```
**`FmhaBwdConvertQGradKernel`** follows the same pattern:
- Group mode: extracts `dq_acc_ptr`, `dq_acc_batch_offset_ptr`, and
`nsplits_ptr` (`GetDqAccSplitsOffset(batch)`) from workspace
- Batch mode: reads nsplits from `nsplits_ptr[0]`; batch offset computed
internally
### Addressing in `run_()` (group mode)
```cpp
// Per-batch processing:
const long_index_t batch_offset_dq_acc = kargs.dq_acc_batch_offset_ptr[i_batch];
// seqq_i (physical length) derived from seqstart_q_ptr
const index_t seqq_i = kargs.seqstart_q_ptr[i_batch+1] - kargs.seqstart_q_ptr[i_batch];
// nsplits_i read from nsplits_ptr (convert_dq kernel) or from GetDqAccSplits
const long_index_t split_stride_i = static_cast<long_index_t>(seqq_i) * kargs.hdim_q;
const long_index_t nhead_stride_i = static_cast<long_index_t>(nsplits_i) * split_stride_i;
// Final address:
dq_acc_base + batch_offset_dq_acc + i_nhead * nhead_stride_i + i_split * split_stride_i
```
#### nsplits computation (`PrepareWorkspaceHost`)
`PrepareWorkspaceHost` is a template method of `FmhaBwdWorkspaceManager`
that still takes `kN0` as a template parameter (from
`BlockFmhaShape::kN0` of the dq_dk_dv pipeline). However, this parameter
is **only used inside this host-side function** to compute nsplits — it
is no longer passed into the convert_dq kernel.
| Mode | nsplits computation |
|------|---------------------|
| kUseQrQtrDorPipeline | Writes dq directly; nsplits[0]=0; returns
device_ws_size=0 |
| Non-deterministic | nsplits[0]=1; dq_acc used for atomic add |
| Deterministic + group mode | `ceil((seqstart_ks[i+1]-seqstart_ks[i]) /
kN0)` computed per batch |
| Deterministic + batch mode persistent | Same logic as the original
`GetDqAccSplits` (`dqdqkdv_workers` based) |
### Removing kN0 dependency from `FmhaBwdConvertQGradKernel`
`FmhaBwdConvertQGradKernel` previously required `kN0` as a template
parameter (via `BlockFmhaBwdConvertQGradPipelineProblem`) for two
purposes:
1. In batch mode `operator()`: self-computing `nsplits = ceil(seqlen_k /
kN0)`
2. The `b{kM0}x{kN0}` component of the kernel name string
Both have been removed in this refactor:
- **Batch mode**: now reads `kargs.nsplits_ptr[0]` directly (guarded by
`if constexpr(kIsDeterministic)` to avoid accessing a non-existent field
in non-deterministic instances)
- **Kernel name**: simplified to `b{kM0}`, no longer includes `kN0`
- **Template parameters**: `BlockFmhaBwdConvertQGradPipelineProblem`
drops the `kN0_` parameter; `fmha_bwd_convert_dq_traits_` drops the
`kN0` parameter; `F_bn0`/`convert_dq_bn0` fields removed from codegen
Effect: all dq_dk_dv tiles sharing the same `(hdim, dtype, mode, pad,
deterministic)` combination — regardless of `F_bn0` value
(16/64/128/192/256) — now share a **single** convert_dq kernel instance.
---
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
c1bf3f6972 |
[CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574)
## Summary Add a CK_TILE forward kernel implementing [SageAttention v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that applies multi-granularity quantization to Q/K/V before computing attention, trading minimal accuracy loss for higher throughput on low-precision hardware. ### Quantization design | Tensor | Supported data types | Scale granularity options | |--------|---------------------|--------------------------| | Q | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp (32 tokens), per-thread (4 tokens) | | K | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp (64 tokens), per-thread (16 tokens) | | V | fp8 | per-channel (always) | | O | bf16 | — | Three precision combinations are supported: `fp8/bf16` (QKV fp8, O bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK int4, V fp8, O bf16). ### Architecture support - **gfx9** (CDNA2/3, e.g. gfx90a, gfx942) — full tile set - **gfx950** (CDNA4) — restricted tile set (N-per-block capped at 64 for fp8-family dtypes) ### Implementation - Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC` (async copy) - Masking support: no mask, causal (top-left / bottom-right), and generic windowed - Batch and group (variable-length) modes - Head dimension: d=128, d_v=128 - Python codegen under `example/ck_tile/49_sageattention/codegen/` generates kernel instances per target/dtype/tile combination - Smoke tests included via `tile_example_sageattn_fwd` ### Test commands \`\`\`bash # fp8 QKV ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=fp8bf16 -qscale=3 -init=3 # int8 QK, fp8 V ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=i8fp8bf16 -qscale=3 -init=3 \`\`\` \`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread |
||
|
|
2a479f7411 |
[CK][CK_TILE] Fix FMHA codegen group mode dispatch (#6764)
## Motivation FMHA codegen had incorrect dispatch behavior in group mode. Two root causes: 1. Wrong field names in dispatch conditions — Used batch-mode fields (seqlen_q, seqlen_k) instead of group-mode fields (max_seqlen_q, max_seqlen_k), causing wrong kernel selection at runtime on gfx950. 2. Missing kernel variants — Group mode was overly filtered out from smaller-tile specializations (bwd) and lacked spatial-padding pipeline variants on gfx950 (fwd). gfx942 don't support trload pipeline. ## Technical Details fmha_bwd.py: - max_seq_q_cond and extra_cond now emit t.max_seqlen_q / t.max_seqlen_k for group mode. - Relaxed kernel filtering: group mode no longer skips tiles with max_seq_q != 0. fmha_fwd.py: - get_bm0_cond emits a.max_seqlen_q for group mode tile-size dispatch. - Added two qr_async_trload pipeline variants with spatial padding for gfx950 group mode. ## Test Plan Triggering AITER CI job: ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
a581a451f1 |
[CK Tile] Adding WMMA wrappers for dense builtins (#5801)
## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the first of the series of PRs that add all the necessary MMA builtins as a `amdgcn_mma` structs. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we have now 9 RDNA4 builtins and 3 RDNA3 builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yung-sheng Tu <yung-sheng@streamhpc.com> |
||
|
|
34084aeb11 |
[CK_TILE] fix(fmha): support >2GB KV cache in batch prefill via template dispatch (#6653)
## Motivation
The CK batch prefill kernel previously failed (silent overflow + page
faults) when the KV cache exceeded 2 GB, blocking long-context inference
workloads (e.g., 128K+ token contexts with paged KV).
Two distinct failure modes were addressed:
1. **>4GB SRD overflow (`page_size < kN0`):** The SRD
`buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small
page sizes the rebased SRD spans the full KV pool and the offset wraps
past 2 GB, corrupting K/V loads.
2. **gfx950 page-table fault (`page_size >= kN0`):** On CDNA4 the
hardware validates the **full SRD `num_records` range** against
page-table permissions (CDNA3 only checks per-instruction `voffset`).
After per-tile SRD rebase, an un-trimmed `num_records` field extends
past the live page and faults on freed/protected memory.
## Technical Details
**Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad`
template parameter:**
| Case | `page_size` | KV cache size | Mode | Load path | Addressing |
|---|---|---|---|---|---|
| 1 | `>= kN0` (large pages) | any | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, bounded by per-page rebase |
| 2 | `< kN0` (small pages) | `<= 2 GB` | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, fits in INT32 byte range |
| 3 | `< kN0` (small pages) | `> 2 GB` | Global-load
(`kUseGlobalLoad=true`) | `async_load_tile_raw_flat` (K) +
`load_tile_flat` (V) | 64-bit |
**Dispatch:** the auto-gen API layer (`fmha_batch_prefill.py`) selects
the kernel instantiation at launch from `(page_block_size,
num_total_pages * batch_stride_k * kElementBytes)`, so the small-page
penalty is paid only when correctness requires it.
**gfx950 SRD `num_records` trimming:** in the K and V rebase lambdas of
`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`,
`set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after
each rebase to constrain `num_records` to the live page. Required for
CDNA4 page-table validation; harmless on CDNA3.
**Pipeline sync for the global-load path:**
- V uses synchronous `load_tile_flat`; K uses
`async_load_tile_raw_flat`.
- `v_physical_pages_current` is double-buffered so the V flat load
doesn't race against the next iteration's K rebase computation.
**Arch guards:** `global_load_lds` intrinsics are gated to `__gfx94__` /
`__gfx950__` (CDNA3+). Other architectures hit a `dependent_false`
static_assert with a descriptive message.
**Device-side assertion convention:** SRD setters use
`__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s
`assert()`. The latter introduces an `__assert_fail` call whose register
pressure scatters the K-SRD scalar register window across conditional
branches, corrupting `buffer_load_dwordx4` on gfx950.
## Test Plan
Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper
test suite. All coverage lives in **`op_tests/test_batch_prefill.py`**:
- **Functional matrix (96 cases)** — `test_batch_prefill`: `page_size ∈
{1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16,
fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1,
4}` (parametrized to exercise per-sequence SRD rebase across batch
boundaries).
- **>2 GB coverage** — `test_batch_prefill_large_kvcache`: extended to
allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true`
(small-page) and `kUseGlobalLoad=false` (large-page rebase) paths.
Includes both single-batch and multi-batch (`batch_size=4`) cases to
exercise per-sequence SRD rebase across the >2 GB pool.
- Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol`
from the existing batch prefill test harness.
## Test Result
| Arch | `test_batch_prefill` | `test_batch_prefill_large_kvcache` (>2
GB) |
|------|----------------------|---------------------|
| MI308 (gfx942) | All passed | Passed |
| MI355 (gfx950) | All passed | Passed |
**Performance impact (gfx950, hot SRD path):**
- +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang /
causal / soft_cap=30`, attributable in full to the two
`set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas
(5-run median, signal/noise ≈ 9×).
- This cost is **mandatory for gfx950 correctness** on >2 GB workloads —
removing the setters re-introduces page-faults.
- gfx942: 0 regressions in the same range (all configs ≤ +0.97%).
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
cfb09d76a5 |
[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550)
## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
803874c73b |
CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479)
## Motivation
The existing paged-KV attention pipelines (pagedkv, splitkv) support
StreamLLM-style sink tokens — a fixed set of initial tokens kept in
attention alongside the sliding window. The `batch_prefill` pipeline
(chunked-prefill with VLLM-style block tables) previously hardcoded
`kHasSink = false`, making it incompatible with sink-based attention
patterns in LLM serving scenarios.
This PR extends `batch_prefill` to support `kHasSink` and wires it
into `fmha_fwd_runner` for validation against the existing CPU
reference.
## Technical Details
**Pipeline** (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`):
- When `kHasSink`, the K/V loop splits into a sink phase [0,
sink_seq_end)
and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv.
- K advance at the sink→window transition jumps
`seqlen_k_start - sink_seq_end + kN0` to bridge the gap.
- V scatter-gather offsets are re-initialized at the transition to fix a
window mismatch bug: V was lagging kN0 behind K after the large jump,
loading from the wrong sequence position.
- Bias window, dropout seq_offset, and mask type (LogitsSinkMask)
updated
for sink-awareness.
**Traits / codegen** (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`,
`fmha_batch_prefill.py`):
- `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded
`false`).
- Codegen adds `F_sink` field; skips batch-mode kernels (group mode
required).
- CMake test filter broadened from 9 → 33 instances covering
fp16/bf16 × mask/nmask × lse/nlse × sink/nsink.
**Runner** (`fmha_fwd_runner.hpp`, `CMakeLists.txt`):
- `fmha_batch_prefill()` dispatched from `run_fwd` when:
group mode + paged KV + num_splits == 1.
- K/V strides corrected for runner's [num_pages, nhead_k,
page_block_size, hdim] layout.
- `page_block_size % 128` check relaxed: batch_prefill supports ps=16.
- CPU reference paged-KV reordering guards extended with
`CK_TILE_FMHA_FWD_BATCH_PREFILL_API`.
## Test Plan
Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run
`tile_example_fmha_fwd` in group mode with page_block_size=16.
Test matrix:
- Mask: no-mask, causal, sliding window
- Sink: nsink, sink=1..128
- dtype: fp16, bf16
- LSE output: on/off
- seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024}
- GQA, chunked prefill, large batch×seqlen
- page_block_size: 16, 32
## Test Result
171 test cases, all valid:y:
- nmask + nsink: ✓
- causal + nsink: ✓
- causal + sink=8: ✓
- sliding window + sink=8 (d=128, d=256): ✓
- bf16, LSE output, GQA: ✓
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
2574f37483 |
[CK_TILE] Enable canonical-NaN BF16 conversion for FMHA on RDNA (#6253)
## Motivation
- On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA
forward adds noticeable overhead and causes a meaningful performance gap
versus fp16. The asm-based path (mode 3) does not improve this on RDNA
and can perform even worse.
- In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower
than the corresponding fp16 path.
- This PR reduces that gap by switching FMHA forward to a different BF16
conversion strategy based on Triton’s canonical-NaN
round-to-nearest-even behavior.
## Technical Details
- Add a new `standard_cnan` BF16 conversion mode to CK Tile.
- Implement a canonical-NaN RTN `float -> bf16` conversion path based on
the Triton implementation.
- Enable this conversion mode by default for FMHA forward builds
targeting gfx11/gfx12.
- Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some
`hdim=128` cases to keep kernel selection aligned with the updated
conversion behavior.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- all tests passed when running `test_ck_tile_fmha`
- BF16 FMHA forward performance improves by up to ~5% on gfx11.
- BF16 FMHA forward performance improves by up to ~10% on gfx12.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
f5e00ec904 |
[CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450)
## Motivation
`qr_hpad` currently executes work for padded head-dim fragments even
when only a subset of the values are valid. This adds unnecessary
computation for head dimensions that require padding, such as `hdim=72`
and `hdim=80`, and hurts FMHA forward performance.
The goal of this PR is to make the padded-head-dim path skip invalid
work based on the actual valid fragment count, while preserving the
existing behavior for the non-padded path.
## Technical Details
This PR improves the `qr_hpad` FMHA forward path in three parts:
- Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part
of the fragment is valid.
- Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the
valid fragment range instead of always computing over the padded extent.
- Retune the gfx11 `qr_hpad` kernel configuration after enabling the
partial-fragment path.
To keep the existing path stable, the implementation adds overloads for
the updated GEMM/pipeline interfaces. This allows existing full-tile
callers to keep using the previous form, while the `qr_hpad` path can
pass valid fragment counts when needed.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- On gfx11 and gfx12, for head dimensions that require padding,
`tile_example_fmha_fwd` shows about 20-30% performance improvement at
`hdim=72/80`.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
0ddf22610c |
[CK] Remove obsolete benchmark_fwd_v3.sh script and README reference (#6305)
The tile_example_fmha_fwd_v3 target no longer exists in this project, making this benchmark script non-functional. |
||
|
|
027b95a21c |
[CK_TILE] Add CShuffleLds microbenchmark suite (#5383)
## Summary
Microbenchmarks isolating LDS store/load operations in CShuffleEpilogue
for bank conflict analysis.
## Motivation
CShuffleEpilogue performs LDS store (MFMA registers → LDS) and load (LDS
→ registers for coalesced global writes). This suite isolates each
operation to:
- Identify which operation causes bank conflicts
- Measure pure LDS bandwidth per access pattern
- Validate access patterns across MFMA tile sizes and wave layouts
## Components
- **Microkernels** (`tile_load_store_microkernels.hpp`):
`StoreTile<Setup>`, `LoadTile<Setup>`
- **Setup Adapters** (`benchmark_cshuffle_lds.hpp`): Wire
CShuffleEpilogue to microkernels
- **Template** (`benchmark_template.cpp.in`): Generated benchmarks with
timing
## Build
```bash
cmake -G Ninja -B build -S . \
-DGPU_TARGETS=gfx950 \
-DBUILD_CK_EXAMPLES=ON \
-DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON
ninja -C build bench_lds_fp8_16x16x128_2x2_fp8
```
## New CMake Options
| Option | Default | Description |
|--------|---------|-------------|
| `BUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS` | OFF | LDS microbenchmarks |
| `BUILD_CK_TILE_FMHA_TESTS` | ON | FMHA tests |
| `BUILD_CK_TILE_ENGINE` | ON | Tile engine |
| `BUILD_CK_TILE_ENGINE_TESTS` | ON | Tile engine tests |
| `BUILD_CK_EXAMPLES` | ON | Examples |
| `BUILD_CK_TUTORIALS` | ON | Tutorials |
| `BUILD_CK_DEVICE_INSTANCES` | ON | Device instances |
| `BUILD_CK_PROFILER` | ON | Profiler |
Setting guards to OFF reduces cmake configure from ~150s to ~5s.
---------
Made-with: Claude Code, Opus 4.5
|
||
|
|
1ae4925302 |
[CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863)
## Motivation
The PermuteN epilogue was previously embedded within
cshuffle_epilogue.hpp, despite having fundamentally different behaviour.
Coupling these two independent strategies in one file introduced
unnecessary complexity, SFINAE guards, and a dual operator() overload
selected at compile time via TiledMMAPermuteN_ template parameter.
This PR separates PermuteN into its own standalone
file(pertmuten_epilogue.hpp), simplifying both implementations and
making the codebase easier to maintain and extend independently.
## Technical Details
**New file: permuten_epilogue.hpp:**
contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from
the permuteN code path in cshuffle_epilogue.hpp.
**Cleanup of cshuffle_epilogue.hpp:**
- Removed the TiledMMAPermuteN_ template parameter from
[CShuffleEpilogueProblem]
- Removed the SFINAE-guarded permuteN operator() overload
- Removed the EnablePermuateN_ SFINAE alias
- CShuffle now only contains CShuffle logic; EightWave support
(independent feature) is retained
**Consumer migration :**
All consumer files now use compile-time epilogue selection via
[std::conditional_t]
`using GemmEpilogue = std::conditional_t<
TiledMMAPermuteN,
PermuteNEpilogue<PermuteNEpilogueProblem<...>>,
CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;`
**Files modified:**
- flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp,
mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples
- run_gemm_quant_example.inc — block-scale GEMM example
- gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker
- test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp,
test_gemm_pipeline_util.hpp — test utilities
- universal_gemm_invoker.hpp — universal GEMM invoker
- epilogue.hpp — add header updated to include permuten_epilogue.hpp
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
|
||
|
|
6cd016dde4 |
[CK Tile] Add Tile Distribution Encoding Calculator (#5515)
## Motivation We want to be able to calculate TileDistributionEncodings describing register mappings for any MmaOp. This is necessary for further integration with CK Tile. This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma type (MmaOp) and provides ABC warp distribution encodings for mapping matrix fragment coordinates to register coordinates (lane, vector item) and vice versa. It is able to take CTranpose, Swizzle, and NumAccessA / NumAccessB template parameters for tweaking the tile distributions. Swizzle modification will be implemented later. The current implementation can deal with all intrinsic types and block-hiding. This MR also adds some additional static asserts and derived params within amdgcn_mma_base, to enforce consistency and help calculate Tile Distributions for block-hiding intrinsics. An Example was added that uses the Tile Distr Enc Calc to calc and print register layouts for Tile Distributions for some of our amdgcn_mma structs. It also makes sure that the CTranspose modifier works as intended. Some additional gfx9 intrinsics were added to test block-hiding layouts for the different types of C-block-hiding layouts. The sparse intrinsic wrappers were updated according to Chris's recent changes in another branch (https://github.com/ROCm/rocm-libraries/pull/5508), which moved the compression step outside of the intrinsic itself. This is necessary to make sure that the Calculator can deal with this new interpretation of the sparse intrinsics. I directly copied the new amdgcn structs from Chris's branch and changed nothing else to avoid more complex merges in the future. Note that this means I did not update a bunch of related sparse code since that would be a lot, and therefore I disabled test_amdgcn_sparse_mma for now. The amdgcn_mma_layout test was refactored a bit: - The old register mapping utility was removed and its use was replaced by the new TileDistrEncCalc - More tests were added to test layouts for different types of block-hiding and sparse intrinsics - The Selector method was removed and the tests were split up over target architectures, with each target arch having a direct list of amdgcn structs to be tested. This ensures that we force specific tests on specific architectures and makes sure that the selector doesn't quietly do some workarounds like creating compound intrinsics. ## Test Results Layout tests based on calculated tile distribution encodings pass on all architectures. Calculator works for all currently added amdgcn structs, which includes different types of block-hiding and sparse intrinsics. Printed layouts from new example verified by eye. CTranspose modifier tested for large set of intrinsics. |
||
|
|
4ccbcbe0a4 |
CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302)
Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. **Pre-filter**: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. **Expert triage**: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. | Classification | Count | |---------------|-------| | Removed (this PR) | 41 | | Kept (debug helpers, alt configs, reference impls) | 32 | | Not code (false positives) | 16 | Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code. |
||
|
|
65ad35becd |
[CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12 (#6156)
## Motivation
On gfx11/gfx12, FMHA forward kernels that require head-dim padding show
a large performance drop compared to the exact-head-dim path. In
practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too
far off the fast path.
This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while
keeping the behavior for other GPUs unchanged.
## Technical Details
- Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path
for gfx11/gfx12.
- For `receipt=0`, keep support conservative and only enable the padded
fast path for vector-safe cases (`head_dim % 8 == 0`), matching the
existing assumption used on other GPUs.
- Move `v_prefetch` later only for the head-dim-padded path on
gfx11/gfx12. This reduces live ranges and removes the register-spill
behavior seen in the earlier scheduling.
- Enable the buffer-load OOB check offset trick for the padded path on
gfx11/gfx12.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
Observed padded-head-dim performance improvements for HDIM=72/80:
- gfx11: about ~3.5x
- gfx1151: about ~2.0x
- gfx12: about ~1.3x
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
ceddfcc13c |
[CK_TILE] Refine FMHA Readme (#6003)
Updates the FMHA README to document fp8 precision support more
accurately, replacing the outdated "experimental" section and incomplete
CLI arg descriptions.
## Changes
- **`-prec` arg**: expanded supported values from `fp16/bf16/fp8/bf8` →
`fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4`
- **`-qscale` arg**: replaced single-line `1: per-tensor quantization`
with all four modes: `pt/1`, `bs/2`, `kvbs/3`, `mx/4`
- **FP8 support section**: replaced "FP8 experimental support" paragraph
with:
- Supported targets: gfx942/gfx950 + ROCm 6.0+
- Table distinguishing `fp8` / `fp8bf16` / `fp8fp32` by Q/K/V input type
and output type
- Table for all `-qscale` modes with descriptions
- Note that `-vlayout=r` (`seqlen*hdim` for V) is the only supported
layout for fp8 types
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
Please open a PR against base branch `develop` in repository
`ROCm/rocm-libraries` applying the following documentation updates
within the composable kernel path.
## Scope
Update the file:
- `projects/composablekernel/example/ck_tile/01_fmha/README.md`
## Changes to apply
Apply the combined edits described in the diffs below (two consecutive
patches). Ensure the final file content includes **both** sets of
changes.
### Patch 1
- In the CLI args section:
- Update `-qscale` description lines to include:
- `pt or 1, per-tensor scale`
- `bs or 2, block scale`
- `kvbs or 3, Q per-tensor, K/V per-page block scale`
- `mx or 4, microscaling (exclusively for mxfp8/mxfp4)`
- Update `-prec` supported data types from `fp16/bf16/fp8/bf8` to
`fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4`.
- Replace the existing "FP8 experimental support" section with an "FP8
support" section stating:
- FP8 FMHA kernels supported on gfx942/gfx950 with ROCm 6.0+
- Precision selectable via `-prec=fp8` (or `fp8bf16`, `fp8fp32`) for
`tile_example_fmha_fwd`
- Add a table describing `-qscale` modes:
- `n` or `0`: No quantization scale (default)
- `pt` or `1`: Per-tensor quantization scale
- `bs` or `2`: Per-block quantization scale
- `kvbs` or `3`: Q per-tensor + K/V per-page block scale
- `mx` or `4`: Microscaling (MX format), exclusively for `mxfp8` and
`mxfp4`
- Add/keep note that currently only `-vlayout=r` (`seqlen*hdim` for V
matrix) is supported for fp8 data types.
### Patch 2
Further refine the "FP8 support" paragraph to explain the difference
between `fp8`, `fp8bf16`, and `fp8fp32` via a table:
| `-prec` value | Q/K/V input type | Output type | Description |
|---|---|---|---|
| `fp8` | fp8 | fp8 | Fully fp8: both inputs and output are in fp8 |
| `fp8bf16` | fp8 | bf16 | Mixed precision: fp8 inputs, bf16 output —
useful when the consumer expects a wider-range output format |
| `fp8fp32` | fp8 | fp32 | Mixed precision: fp8 inputs, fp32 output —
highest-precision output, suitable for debugging or further fp32
processing |
Keep the rest of the `-qscale` table and the `-vlayout=r` limitation
note.
## Notes
- PR title must be: `[CK_TILE] Add fp8 in FMHA readme`
- Ensure markdown formatting renders correctly (tables, code
formatting).
- Only modify the file listed above.
The following is the prior conversation context from the user's chat
exploration (may be truncated):
User: 能幫我上這個pr嗎 在composable kernel裡的路徑
diff --git a/example/ck_tile/01_fmha/README.md
b/example/ck_tile/01_fmha/README.md
index 0b526f4e9fc..1627435863b 100644
--- a/example/ck_tile/01_fmha/README.md
+++ b/example/ck_tile/01_fmha/README.md
@@ -62,14 +62,17 @@ args:
-d_v head dim for v, -1 means equal to d (default:-1)
-scale_s scale factor of S. 0 means equal to 1/sqrt(hdim). (default:0)
-qscale n or 0, no scaling (default:n)
- 1: per-tensor quantization.
+ pt or 1, per-tensor scale
+ bs or 2, block scale
+ kvbs or 3, Q per-tensor, K/V per-page block scale
+ mx or 4, microscaling (exclusively for mxfp8/mxfp4)
-iperm permute input (default:1)
if true, will be b*h*s*d, else b*s*h*d
-operm permute output (default:1)
-bias n or 0, no bias (default:n)
e(lementwise) or 1, elementwise bias with 1*1*s*s. e:1, 1*h*s*s. e:2,
b*h*s*s
a(libi) or 2, alibi with 1*h. a:1, b*h
- -prec data type. fp16/bf16/fp8/bf8 (default:fp16)
+ -prec data type. fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4
(default:fp16)
-mask 0: no mask, 1: top-left(same as 't'), 2:bottom-right(same as 'b')
(default:0)
't', top-left causal mask, 'b', bottom-r causal mask
't:l,r', top-left sliding window attn(swa) with FA style left right size
@@ -161,7 +164,17 @@ We support sequence padding and variable-length
processing in both batch and gro
Both approaches optimize memory access patterns while supporting
flexible sequence length requirements commonly found in transformer
inference scenarios.
-## FP8 experimental support
-As described in [this
blog](https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482),
we have an experimental support for fp8 fmha kernels, you can evaluate
the performance by setting the arg `-prec=fp8` to the
`tile_example_fmha_fwd`, on a gfx942 machine and ROCm 6.0+.
+## FP8 support
+FP8 FMHA kernels are supported on gfx942/gfx950 machines with ROCm
6.0+. You can select fp8 precision by setting the arg `-prec=fp8` (or
`fp8bf16`, `fp8fp32`) to the `tile_example_fmha_fwd`.
-Currently we only support `-vlayout=r`( `seqlen*hdim` for V matrix) for
fp8 and fp8bf16 now. Full feature support will come later.
+The following quantization scale modes are available via `-qscale`:
+
+| `-qscale` value | Description |
+|---|---|
+| `n` or `0` | No quantization sca...
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
*This pull request was created from Copilot chat.*
>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: asleepzzz <4926646+asleepzzz@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
|
||
|
|
6dc44114ba |
[CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051)
## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details **Warp GEMM:** - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries **V3 Kernel (`fmha_fwd_v3_kernel.hpp`):** - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs **V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`):** - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` **V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`):** - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) **Codegen (`fmha_fwd.py`):** - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor **Misc:** - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) | Problem | Regular (TFlops) | Varlen (TFlops) | |---|---:|---:| | batch=1 heads=6/1 seqlen=1024 causal | 48.9 | 47.6 | | batch=1 heads=6/1 seqlen=2048 causal | 119.8 | 117.4 | | batch=1 heads=6/1 seqlen=4096 causal | 263.7 | 259.2 | | batch=1 heads=6/1 seqlen=8192 causal | 548.9 | 543.6 | | batch=1 heads=6/1 seqlen=16384 causal | 1043.0 | 1063.7 | | batch=1 heads=6/1 seqlen=32768 causal | 1237.2 | 1279.6 | | batch=1 heads=6/1 seqlen=65536 causal | 1315.4 | 1382.7 | | batch=1 heads=6/1 seqlen=131072 causal | 1326.3 | 1402.2 | | batch=1 heads=16/1 seqlen=65536 causal | 1298.7 | 1388.4 | | batch=1 heads=40/40 seqlen=37200 non-causal | 1248.9 | 1326.1 | ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
5370485459 |
[CK_TILE] Restrict FMHA codegen to the kernel subset used by FlashAttention (#6038)
## Motivation Currently, the CK FlashAttention integration generates a broader FMHA kernel set than the FlashAttention wrappers can actually dispatch, which increases compile time without improving runtime coverage. ## Technical Details The FlashAttention CK wrappers do not use all logits/LSE variants emitted by the default FMHA codegen. The direct `fmha_fwd` path always uses softcap-disabled, LSE-enabled kernels, and the `fmha_fwd_splitkv` path only uses softcap-disabled kernels. This change trims codegen to that subset and stops generating the unused logits/LSE variants. This reduces the generated forward kernel set without changing `fmha_fwd_appendkv` or `fmha_bwd`. The reduced kernel set was validated by building and running the [FlashAttention](https://github.com/Dao-AILab/flash-attention) CK backend. Across targets, the total generated FMHA kernel count is reduced by: - `gfx942`: 29.3% - `gfx1100`: 33.7% - `gfx1201`: 31.3% ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> pytest test/test_flash_attn_ck.py from https://github.com/Dao-AILab/flash-attention ## Test Result all tests passed <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
ba0efe01af |
[CK Tile] Add sink token gradient support in FMHA backward pass (#5504)
## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
47fb489d78 | [CK_TILE] Update gfx12 FMHA forward kernel configs (#5798) | ||
|
|
5844015670 |
[CK_TILE] Fix Windows build in FMHA head grouping (#5977)
## Motivation This is a follow-up fix for [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018). [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it also introduced Linux-only code paths, including `<dirent.h>`, which break Windows builds. This change fixes that by guarding the Linux-specific LLC probing logic so non-Linux platforms can still build correctly. ## Technical Details - Guard `<dirent.h>` with `#ifdef __linux__` - Guard KFD sysfs traversal logic with `#if defined(__linux__)` - On non-Linux platforms, return `0` from `get_kfd_sysfs_llc_cache_bytes()` - Preserve existing fallback behavior through: - `CK_TILE_FMHA_LLC_CACHE_MB` - arch-based default LLC sizes - no head grouping when no LLC size can be resolved ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
fa912ed457 |
[CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill (#5918)
## Motivation
FP8 batch prefill kernels currently only support head_dim=128. Models
with head_dim=256 hit the "invalid argument for batch_prefill" error
because no matching kernel variant exists in the codegen dispatch.
## Technical Details
Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen
recipe (`fmha_batch_prefill.py`).
Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1,
4,1,1, 32,32,32, 32,32,32, -1)`
- bm0=128, bn0=128 (Q/K tile sizes)
- bn1=256, bk0max=256 (V head_dim=256)
- Warp MFMA 32x32x32 (fp8 MFMA instructions)
This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp
sizes.
## Test Plan
Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch
prefill test with the following matrix:
- page_size: {1, 16, 1024}
- kv_layout: {linear, vectorized}
- lookup_table: {sglang, vllm}
- causal: {true, false}
- logits_soft_cap: {0.0, 30.0}
- contiguous_kv: {true, false}
## Test Result
**MI308X (gfx942):** 160 passed, 32 skipped (page_size=1 + vectorized
not applicable)
**MI355X (gfx950):** 120 passed, 72 skipped (pre-existing ROCm 7.2
compiler issue with causal + no softcap)
No register spills on either platform.
### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8
heads
| page_sz | kv_layout | table | causal | soft_cap | time_us | TFLOPS |
|---------|-----------|-------|--------|----------|---------|--------|
| 1 | linear | sglang | False | 0.00 | 55.01 | 156.16 |
| 1 | linear | vllm | False | 0.00 | 55.12 | 155.84 |
| 1 | linear | sglang | False | 30.00 | 62.63 | 137.16 |
| 1 | linear | vllm | False | 30.00 | 62.16 | 138.20 |
| 1 | linear | sglang | True | 30.00 | 64.09 | 67.01 |
| 1 | linear | vllm | True | 30.00 | 63.85 | 67.27 |
| 16 | linear | sglang | False | 0.00 | 57.00 | 150.69 |
| 16 | vectorized | sglang | False | 0.00 | 57.55 | 149.25 |
| 16 | linear | vllm | False | 0.00 | 56.80 | 151.23 |
| 16 | vectorized | vllm | False | 0.00 | 57.32 | 149.87 |
| 16 | linear | sglang | False | 30.00 | 64.77 | 132.62 |
| 16 | vectorized | vllm | False | 30.00 | 63.54 | 135.18 |
| 16 | linear | sglang | True | 30.00 | 66.84 | 64.26 |
| 16 | vectorized | vllm | True | 30.00 | 66.12 | 64.96 |
| 1024 | linear | sglang | False | 0.00 | 58.25 | 147.46 |
| 1024 | vectorized | sglang | False | 0.00 | 57.53 | 149.31 |
| 1024 | linear | vllm | False | 0.00 | 58.06 | 147.94 |
| 1024 | vectorized | vllm | False | 0.00 | 57.55 | 149.27 |
| 1024 | linear | sglang | False | 30.00 | 65.38 | 131.38 |
| 1024 | vectorized | vllm | False | 30.00 | 63.64 | 134.98 |
| 1024 | linear | sglang | True | 30.00 | 66.85 | 64.25 |
| 1024 | vectorized | vllm | True | 30.00 | 65.26 | 65.81 |
### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8
heads
| page_sz | kv_layout | table | causal | soft_cap | time_us | TFLOPS |
|---------|-----------|-------|--------|----------|---------|--------|
| 1 | linear | sglang | False | 0.00 | 110.18 | 77.96 |
| 1 | linear | vllm | True | 30.00 | 134.33 | 31.97 |
| 1 | linear | sglang | True | 30.00 | 134.59 | 31.91 |
| 16 | linear | sglang | False | 0.00 | 115.43 | 74.42 |
| 16 | vectorized | sglang | False | 0.00 | 106.11 | 80.95 |
| 16 | linear | vllm | False | 0.00 | 116.34 | 73.83 |
| 16 | vectorized | vllm | False | 0.00 | 106.17 | 80.91 |
| 16 | linear | sglang | False | 30.00 | 135.61 | 63.34 |
| 16 | vectorized | vllm | False | 30.00 | 122.37 | 70.20 |
| 16 | linear | sglang | True | 0.00 | 117.44 | 36.57 |
| 16 | vectorized | vllm | True | 0.00 | 108.81 | 39.47 |
| 16 | linear | sglang | True | 30.00 | 139.43 | 30.80 |
| 16 | vectorized | vllm | True | 30.00 | 125.87 | 34.12 |
| 1024 | linear | sglang | False | 0.00 | 110.65 | 77.63 |
| 1024 | vectorized | sglang | False | 0.00 | 101.70 | 84.46 |
| 1024 | linear | vllm | False | 0.00 | 111.71 | 76.89 |
| 1024 | vectorized | vllm | False | 0.00 | 101.55 | 84.59 |
| 1024 | linear | sglang | False | 30.00 | 129.33 | 66.42 |
| 1024 | vectorized | vllm | False | 30.00 | 120.95 | 71.02 |
| 1024 | linear | sglang | True | 0.00 | 112.26 | 38.26 |
| 1024 | vectorized | vllm | True | 0.00 | 103.02 | 41.69 |
| 1024 | linear | sglang | True | 30.00 | 133.73 | 32.12 |
| 1024 | vectorized | vllm | True | 30.00 | 124.75 | 34.43 |
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
||
|
|
c60514f371 |
[CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393)
## Motivation
Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.
## Technical Details
StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)
Both persistent and non-persistent data-parallel (DP) sections are
supported.
Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression
### Performance (MI355X, gfx950)
Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):
| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |
StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.
## Test Plan
```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk
# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```
30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression
## Test Result
All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
b973cc43d3 |
[CK]fix: remove redundant structured sparsity check in run_gemm_example.inc (#5612)
## Motivation This issue if found via https://github.com/ROCm/rocm-libraries/pull/4302#discussion_r2958603418 and is introduced via https://github.com/ROCm/rocm-libraries/pull/5323. ## Technical Details The outer `if` and inner `if constexpr` both checked GemmConfig::UseStructuredSparsity. Merged into a single `if constexpr` since both preshuffle and UseStructuredSparsity are compile-time constants. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
39bc8453c6 |
[CK_TILE] add tf32 support (#4302)
## Proposed changes TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in CK_TILE on gfx942 and gfx950. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [ ] Any dependent changes have been merged ## Discussion --- 🔁 Imported from [ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538) 🧑💻 Originally authored by @yingluAMD --------- Co-authored-by: yingluAMD <Yingmao.Lu@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
75357722b8 |
CK Tile MX GEMM Packing Improvement (#5323)
## Motivation Reduce the scale loading size and also has better utilization of MFMA scale selection. ## Technical Details Add up the packing of mx scales. ## Test Plan Use the existing test cases. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: Enrico Degregori <enrico@streamhpc.com> |
||
|
|
b5894b3cbe |
[CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018)
## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 56.27 | 61.48 | 1.09x 4096 | 67.10 | 72.27 | 1.08x 8192 | 65.99 | 71.64 | 1.09x 12288 | 61.60 | 76.61 | 1.24x 16384 | 58.99 | 75.74 | 1.28x 20480 | 57.32 | 74.42 | 1.30x 24576 | 56.89 | 74.25 | 1.31x 27280 | 18.93 | 24.48 | 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 66.79 | 65.90 | 0.99x 4096 | 85.90 | 86.80 | 1.01x 8192 | 77.06 | 90.29 | 1.17x 12288 | 58.36 | 88.98 | 1.52x 16384 | 52.12 | 88.88 | 1.71x 20480 | 48.11 | 88.42 | 1.84x 24576 | 47.12 | 89.07 | 1.89x 27280 | 49.05 | 50.31 | 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
a9a8a57786 |
[CK Tile] Eight Waves pipeline GEMM (#4964)
## Motivation Eight waves pipeline was added for ABQuant. The goal of this PR is to enable it also for GEMM ## Technical Details Summary: - Block: - Create block struct for GEMM using eight warps specific distribution encodings - Use this block struct in ABQuant for encodings - Pipeline: - Create impl pipeline for eight waves which can be used by GEMM and ABQuant as base (and for AQuant and BQuant in the future) - Create eight waves pipeline for GEMM (this can not be easily integrated in the existing async pipeline) - Pipeline policy: - Extract GEMM specific parts in the ABQuant policy to define GEMM policy (then ABQuant use it as base and add Quant specific methods) - Minor: naming was inconsistent between warp/wave, everything is now referred to as eight waves So overall we have: - block struct directly used by GEMM -> ABQuant derived struct to implement operator - Impl base pipeline with general implementation -> GEMM and ABQuant pipelines use it to avoid code duplication but still define their own pipelines - pipeline policy struct directly used by GEMM -> ABQuant derived policy struct for Quant specific parts ## Test Plan Added new tests for GEMM pipeline: `test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950 supports it). Note: K padding test is disabled for this pipeline because it's not implemented yet ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
6373f5df7b |
[CK][CK Tile] Grouped Convolution Backward Weight set of fixes (#5387)
## Motivation Grouped Convolution Backward Weight split k fixes for CK tile kernels ## Technical Details - get k batch from kargs to get deduced k batch - multiply zeroing size by data type size - disable v6 (producing a incorrect results) ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result Pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Ville Pietilä <> |
||
|
|
f6bfcad437 |
[CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174)
## Motivation This PR enables a persistent-kernel execution path for FMHA backward (dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is split, stored, and converted back to final gradients. ## Technical Details - Introduces a persistent-kernel grid mapping in deterministic mode and updates split-count calculation accordingly. - Extends kernel kargs to carry batch-related info needed for persistent scheduling and dQ conversion. - Refactors dQ store conditions and adds mask-type traits/utilities and runner logging updates. ## Test Plan - Jenkins [base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline) - Jenkins [AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline) - Jenkins [FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline) - local FA tests ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
a124535133 |
ck_tile: add gtest unit tests for MX flatmm (gfx950) (#5082)
## Summary
- Add correctness unit tests for the MX-format flatmm kernel
(`example/ck_tile/18_flatmm/mxgemm`) under `test/ck_tile/flatmm/`
- Tests cover all five dtype combinations: FP4×FP4, FP8×FP8, FP6×FP6,
FP8×FP4, FP4×FP8
- Tests cover all four kernel dispatch paths (the `has_hot_loop` ×
`tail_num` product):
- `has_hot_loop=false, tail=ODD` (K=256, num_loop=1)
- `has_hot_loop=false, tail=EVEN` (K=512, num_loop=2)
- `has_hot_loop=true, tail=ODD` (K=768, num_loop=3)
- `has_hot_loop=true, tail=EVEN` (K=1024, num_loop=4)
- Remove unsupported `-split_k` CLI option from
`tile_example_mx_flatmm`; the pre-shuffled B layout is incompatible with
K-splitting and the option silently produced wrong results
## Changes
**New files (`test/ck_tile/flatmm/`):**
- `CMakeLists.txt` — builds 40 kernel instances as a shared OBJECT
library, links into 5 per-dtype test executables; forwards
`-DCK_TILE_USE_OCP_FP8` when `CK_USE_OCP_FP8` is ON
- `test_mx_flatmm_base.hpp` — base test fixture with
`run_test_with_validation(M, N, K, kbatch=1)`
- `test_mx_flatmm_fixtures.hpp` — concrete `TestMXFlatmm` typed test
class and type aliases
- `test_mx_flatmm_fp{4fp4,8fp8,6fp6,8fp4,4fp8}.cpp` — per-dtype
`TYPED_TEST_SUITE` files
**Modified files:**
- `example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp` — moved
`preShuffleWeight` here (was in `mx_flatmm.cpp`) so it is includeable by
both the example and the tests
- `example/ck_tile/18_flatmm/mxgemm/mx_flatmm.cpp` / `run_mx_flatmm.inc`
— removed `-split_k` CLI arg, hardcoded `k_batch=1`, fixed `k_split`
formula, updated call sites after `preShuffleWeight` move
- `test/ck_tile/CMakeLists.txt` — added `add_subdirectory(flatmm)`
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
|
||
|
|
25d9fdfc16 |
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368)
## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
8b46a8d997 |
[CK_TILE] MX GEMM non-preshuffled RCR layout (#4594)
## Motivation Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled layouts using async pipeline. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> |
||
|
|
559ad6f0b1 |
[CK_TILE] Update gfx11 FMHA forward kernel configs (#5088)
## Motivation Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded seqlen_q/k) cases. This tuning is based on heuristic search and improves performance in most tested shapes. Performance should be evaluated on top of [`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018) (required baseline). ## Technical Details - Updated gfx11 codegen heuristic choices for tile size and occupancy. - Updated gfx11 pipeline selection: - Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad cases are dispatched to the faster kernel path.` - Kept gfx12 unchanged: with PSSK support from [`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957), existing gfx12 config is already sufficient. - Tuning rationale: - In some cases, higher `kBlockPerCu` lowers register pressure. - On RDNA, this generally aligns with better performance when `waves_per_eu >= 6`. ## Test Plan - test_ck_tile_fmha - tile_example_fmha_fwd: tested this on gfx1100 and gfx1151 ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - TFLOPs by sequence length target: `gfx1100` layout: `bhsd` - mode: batch / VGPR usage: 225 vs 214 SeqLen | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 74.10 | 71.97 | 0.97x 4096 | 66.26 | 77.79 | 1.17x 8192 | 68.18 | 75.88 | 1.11x 12288 | 68.47 | 80.44 | 1.17x 16384 | 59.54 | 79.66 | 1.34x 20480 | 55.78 | 77.91 | 1.40x 24576 | 55.08 | 77.47 | 1.41x 27280 | 47.45 | 77.16 | 1.63x - mode: group / VGPR usage: 256 vs 214 SeqLen | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 71.47 | 70.6 | 0.99x 4096 | 64.74 | 77.06 | 1.19x 8192 | 64.68 | 75.47 | 1.17x 12288 | 66.43 | 79.95 | 1.20x 16384 | 56.02 | 79.73 | 1.42x 20480 | 50.21 | 78.15 | 1.56x 24576 | 47.29 | 77.53 | 1.64x 27280 | 46.13 | 77.04 | 1.67x - TFLOPs by sequence length target: `gfx1151` layout: `bshd` - mode: batch / VGPR usage: 225 vs 223 Batch | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 26.85 | 29.17 | 1.09x 4096 | 24.75 | 26.01 | 1.05x 8192 | 25.24 | 25.50 | 1.01x 12288 | 25.18 | 25.00 | 0.99x 16384 | 24.79 | 25.91 | 1.05x 20480 | 25.56 | 25.24 | 0.99x 24576 | 25.13 | 26.20 | 1.04x 27280 | 10.78 | 26.35 | 2.44x - mode: group / VGPR usage: 256 vs 229 Batch | Baseline | Tuned | Gain -- | -- | -- | -- 1024 | 27.44 | 26.71 | 0.97x 4096 | 21.89 | 23.09 | 1.05x 8192 | 22.85 | 24.49 | 1.07x 12288 | 24.33 | 24.42 | 1.00x 16384 | 20.05 | 24.98 | 1.24x 20480 | 14.70 | 25.15 | 1.71x 24576 | 11.30 | 26.31 | 2.33x 27280 | 10.10 | 26.32 | 2.61x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
362bfd72e8 |
[CK] Fix the issue of the aiter to call eightwarps pipeline. (#5218)
## Motivation Fix the failure of the aiter to call eightwarp. Changed Async to the name eightwarps. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan Pass ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
1e8cde4b6d |
[CK_TILE] Add CK Tile bwd weight profiler (#4797)
## Motivation To compare old CK and CK Tile, we need to extend the current CK profiler to support running also CK Tile instance with the same API. In order to have the same instance coverage in CK Tile compared to the old CK, I've added code generation from old CK configurations to CK Tile instances using the CK Builder. ## Technical Details - The codegen python script for CK Tile fwd convs is extended to support also bwd weight and bwd data. - The generated instances are added to the CMake build (target `device_grouped_conv_bwd_weight_tile_instance`s). - A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to the CK Profiler. --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> |
||
|
|
4a6638adcf |
[CK_TILE][FMHA] Extend pipelines with pssk for gfx11/12 (#4957)
## Motivation Build pipelines with seqlen padding only to support vectorized loads in the hdim dimension. The existing pipelines have either all dims padded or all dims not padded. These pipelines can be used in ComfyUI for slightly better performance. ## Technical Details Also a fix included for correct FLOPS calculation in `tile_example_fmha_fwd` when `seqlen_q * seqlen_k` overflows index_t capacity (signed int32). ## Test Plan The existing test cases will use the new pipelines when parameters allow (seqlens - padded, hdims - not padded): ``` ninja test_ck_tile_fmha_fwd bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_fwd_fp8bf16 # for gfx12 ``` ## Test Result All tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
e454358113 |
[CK_TILE] FMHA BWD Launcher Interface (#4577)
## Motivation Reduce memory usage; Be prepared to implement optimizations of reducing nsplits in deterministic cases. ## Technical Details This PR introduces a new launcher interface for the FMHA backward operation, replacing direct function calls with a more structured approach. The launcher encapsulates kernel dispatch logic and provides access to computed metadata like the number of dQ acc splits. **Changes:** - Added `fmha_bwd_launcher` class that wraps kernel execution and exposes `dq_acc_splits` - Moved `fmha_bwd_traits` construction earlier in the execution flow to support launcher initialization - Refactored code generation to produce both legacy API and new launcher constructor ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
5cc04d7676 |
[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045)
## Motivation Still addressing errors that are blocking the merge of TheRock PR: https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382 ## Technical Details 1. There are multiple fmha python scripts that are writing native paths which are confusing cmake. I addressed one of these in an earlier PR https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing more that are exposed with gfx1200 target: ``` [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` 2. In the following compiler error we see gemm_prec_str<ADataType, BDataType> being passed as a function to concat(...), instead of being evaluated with the parenthesis operator(), i.e., gemm_prec_str<ADataType, BDataType>(). There are multiples instances of this, I wonder what non-msvc compilers do here: ``` [composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast] [composable_kernel] 119 | ((oss << sep << rest), ...); [composable_kernel] | ^~~~ [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here [composable_kernel] 248 | return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName()); [composable_kernel] | ^ ``` There are plenty of other places where we use gemm_prec_str with the operator(), so I'm pretty sure these were just typos...but I'd like some eyes on it. 3. There are 2 tests that fail to build on Windows, which I've excluded from the build but will open bug tickets for: 1. gemm_weight_preshuffle 2. grouped_gemm_preshuffle Here's a sample of the compiler error for these tests: ``` [composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations] [composable_kernel] 110 | const char* vp = std::getenv(name); [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here [composable_kernel] 1183 | _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s) [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE' [composable_kernel] 368 | #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT( \ [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT' [composable_kernel] 358 | #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text)) [composable_kernel] | ^ [composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation) [composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918) [composable_kernel] Target: x86_64-pc-windows-msvc [composable_kernel] Thread model: posix [composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin [composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s). [composable_kernel] ninja: build stopped: subcommand failed. [composable_kernel FAILED WITH CODE 1 in 238 seconds] ninja: build stopped: subcommand failed. ``` ## Test Plan Wait for internal CI and make sure build compiles locally. ## Test Result Waiting on CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
2fcb94c18a |
Add Tile Distribution Encoding Register Mapping debug utility for MFMA / WMMA unification work. (#4804)
## Motivation This PR adds a small utility that allows you to use Tile Distribution Encodings to directly map matrix elements to register locations and vice versa. It can also print forward and backward layout mappings similar to the Matrix Calculator utility. The utility is not meant for index calculations in actual kernels, but rather as a debugging tool and probably for automated verification of the policy structs in the new WMMA / MFMA unification design. ## Technical Details Tile Distribution Encodings are a core part of CK Tile which can define the relationship between register and intrinsic matrix fragment elements. They allow for any mapping based on unmerge and merge transformations. Also, they allow for a special "Repeat" dimensions which acts like an additional matrix dimension and allows for replication of certain matrix elements. The new mapping utility can deal with all aspects. ## Test Plan Since this is a debug utility there is nothing to directly test, but there is an example file that defines four different Tile Distribution Encodings and prints their forward and backward mappings, along with some extra parameters. ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
25560c26fb |
[CK] Fix gptoss sink (#4313)
## Motivation This PR removes conditional logic for handling infinity values in the sink mechanism across multiple FMHA pipeline implementations, defaulting sink_size to 0 and adding a constraint in the kernel selection logic. ## Technical Details Changes: Removed __builtin_isinf_sign(sink_v) checks and conditional initialization of LSE accumulators across 7 pipeline files Added default initialization (= 0) for sink_size in 4 argument structs Added F_sink == "f" constraint to kernel compatibility checking ## Test Plan Local test ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> |
||
|
|
1246e65f50 |
[CK TILE] Refactor MX FLATMM example (#4821)
Refactor the MX FLATMM example to support more pipelines across different architectures. This work facilitates the NPI team roadmap. |
||
|
|
20755a0bd2 |
[CK] Add split-K support for ABQuantGrouped in block_scale_gemm (#4816)
## Changes ### Split-K support in `gemm_quant_kernel.hpp` - **`SplitKBatchOffset`**: Added `aq_group_offset` and `aq_k_split_offset` fields (mirroring the existing `bq_*` fields for B) to track each split-K batch's position within the AQ scale tensor. For `ABQuantGrouped`, both offsets are computed from `k_id * KRead` divided by `AQuantGroupSize::kK`. - **`MakeAQBlockWindow`**: Added an `aq_group_offset` parameter (defaulting to 0 for non-split-K paths) so the AQ tensor view's K-group dimension reflects only the remaining K-groups from the split-K offset, consistent with how `MakeBQBlockWindow` handles the BQ tensor. - **`RunGemm`**: Threads the `aq_k_split_offset` through to `MakeAQBlockWindow` when in split-K mode. ### Constraints in `IsSupportedArgument()` Four constraints gate split-K (`k_batch > 1`) for ABQuantGrouped: 1. **Mode check** — split-K is only allowed for `BQuantGrouped` (no preshuffle) or `ABQuantGrouped` (no `APreshuffleQuant`). Any other quant mode with `k_batch > 1` returns `false`. 2. **B quant group alignment** — `KRead` (per-batch K slice) must be divisible by `BQuantGroupSize::kK`. Each batch must operate on complete B quantization groups; a partial group would require splitting a scale value across batches. 3. **A quant group alignment** (new, ABQuantGrouped only) — `KRead` must also be divisible by `AQuantGroupSize::kK` for the same reason applied to the AQ scale tensor. 4. **Minimum 2 K-tile iterations per batch** (new) — The software-pipelined GEMM kernels (CompV3 family) prefetch one tile ahead, so they require `per_batch_num_loop = KRead / KPerBlock >= 2`. When `KRead == KPerBlock` (i.e. each batch is exactly one tile), the prefetch reads into the next batch's memory region and produces incorrect results. Configurations where `K == k_batch * KPerBlock` are therefore rejected. ### Example update (`run_gemm_quant_example.inc`) Updated the comment above the `IsSupportedArgument` call to document that split-K is now supported for both `BQuantGrouped` (no preshuffle) and `ABQuantGrouped` (no `APreshuffleQuant`). ## Unit Tests Two new test files covering decode and prefill tile shapes across a range of `k_batch` values (2–8), data types (FP8, BF8), and quantization group sizes (1×1×128 and 1×128×128 for B): - `test_gemm_quant_abquant_splitk_decode.cpp` — uses the decode tile shape (M=16, N=64, K_tile=256) - `test_gemm_quant_abquant_splitk_prefill.cpp` — uses the prefill tile shape (M=128, N=128, K_tile=128) Each test calls `run_test_with_validation` which runs the kernel and checks correctness against a CPU reference. Configurations excluded from tests are annotated with comments explaining which constraint they violate (typically the `per_batch_num_loop >= 2` requirement). ## Prerequisites This PR depends on #4429, which must be merged before this can be merged. --------- Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> |