composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 13:11:25 +00:00

Files

Hosang Yoon 4c0e73ab12 [rocm-libraries] ROCm/rocm-libraries#6156 (commit 367565a)

[CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12
 (#6156)

## Motivation
On gfx11/gfx12, FMHA forward kernels that require head-dim padding show
a large performance drop compared to the exact-head-dim path. In
practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too
far off the fast path.

This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while
keeping the behavior for other GPUs unchanged.

## Technical Details

- Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path
for gfx11/gfx12.
- For `receipt=0`, keep support conservative and only enable the padded
fast path for vector-safe cases (`head_dim % 8 == 0`), matching the
existing assumption used on other GPUs.
- Move `v_prefetch` later only for the head-dim-padded path on
gfx11/gfx12. This reduces live ranges and removes the register-spill
behavior seen in the earlier scheduling.
- Enable the buffer-load OOB check offset trick for the padded path on
gfx11/gfx12.

## Test Plan

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result

Observed padded-head-dim performance improvements for HDIM=72/80:

- gfx11: about ~3.5x
- gfx1151: about ~2.0x
- gfx12: about ~1.3x

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-04-08 14:53:18 +00:00

__init__.py

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

fmha_batch_prefill.py

[rocm-libraries] ROCm/rocm-libraries#5918 (commit a7e2c67)

2026-03-30 10:21:33 +00:00

fmha_bwd.py

[rocm-libraries] ROCm/rocm-libraries#5504 (commit 47f86c7)

2026-04-02 03:17:45 +00:00

fmha_fwd_appendkv.py

[rocm-libraries] ROCm/rocm-libraries#4812 (commit bb5a4dd)