Files
composable_kernel/example
kensclin fc2862d712 [rocm-libraries] ROCm/rocm-libraries#6846 (commit 377def4)
[CK_TILE] Add fmha forward hdim 256 support (#6846)

## Motivation

Enable Composable Kernel FMHA forward kernel for **hdim=256 BF16** on
AMD gfx950 (MI350X). Prior to this change the (256, 256) head-dim
configuration either failed to compile, was filtered out by the
compatibility rules, or produced incorrect kernel output due to an LDS
layout accounting bug.

## Technical Details

  Four files changed, all to enable hdim=256 BF16 on gfx950.

- **`fmha_fwd.py`** — Allow `(256, 256)` in gfx950 compatibility rule;
set `(256,256)` BF16 tile to `M0=128, N0=64` (the LDS-feasible shape on
gfx950); emit minimal valid instance set for d=256 to bound compile
time.

- **`fmha_fwd_kernel.hpp`** — Gate Prefill launch path off for d=256
(`PrefillCase = kM0 > 64 && kQKHeaddim < 256`); the double-buffer
Prefill variant overflows the 160 KB LDS budget.

- **`trload_policy.hpp`** — **Critical correctness fix**: the LDS layout
accounting in `GetSmemSize` was wrong (`max(Q, K+S+V)` instead of
`max(Q, K) + V + S`), under-allocating LDS and silently corrupting d=256
output (~2% wrong values).

- **`trload.hpp`** — Thread `LoadOnce=true` through all d=256 K-LDS
descriptors so the compiler picks the matching XOR swizzle period;
recompute the S-tile LDS offset to match the corrected `GetSmemSize`
formula.

## Test Plan

Built and ran `tile_example_fmha_fwd` on gfx950 (MI350X) with the
canonical d=256 BF16 configurations:

  ```bash
  cd build && ninja tile_example_fmha_fwd
./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=1 -h=32 -h_k=2
-s=1024 -s_k=1024 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10
-kname=1 -v=1
./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=8 -h=32 -h_k=2
-s=16384 -s_k=16384 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3
-repeat=10 -kname=1 -v=1
  ```

## Test Result

  ```bash
-b=1 -s=1024
[bf16|batch|bhsd] b:1, h:32/2, s:1024/1024, d:256/256, scale_s:0.0625,
bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r,
fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink,
0.058 ms, 298.42 TFlops, 618.68 GB/s, valid:y

-b=4 -s=16384
[bf16|batch|bhsd] b:8, h:32/2, s:16384/16384, d:256/256, scale_s:0.0625,
bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r,
fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink,
42.797 ms, 822.18 TFlops, 106.63 GB/s, valid:y
  ```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com>
2026-05-22 01:57:41 +08:00
..