Files
composable_kernel/example
Linjun-AMD 5c7b7ec3f1 [rocm-libraries] ROCm/rocm-libraries#7272 (commit d02f3c0)
[ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference
 runner (#7272)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

In `fmha_bwd_runner.hpp`, the `sink_host` `HostTensor` is allocated with
first
dimension `shape_batch` (= 1 in group mode), but the reference forward
loop
accesses `sink_host(wb, i_h)` with `wb ∈ [0, batch-1]`. For any `wb >=
1` this
is an out-of-bounds heap read, silently corrupting the reference forward
math
chain (`lse_host`, `o_host`) and turning the bwd-side `d_sink_head_acc`
  reference into non-deterministic garbage.

`HostTensor::operator()` does not bounds check, so the OOB is not caught
at
runtime. This manifests as intermittent `tile_example_fmha_bwd` failures
(25–67% fail rate) when `-sink_grad=1` is combined with `-mode=1` (group
mode),
  with bit-exact but spurious `max_err` values like 4.27 / 14.6.

  ## Fix

One-line: allocate `sink_host` with `batch` (the real per-batch dim)
instead of
  `shape_batch`, mirroring how `sink_host` is accessed by the loop.

  ```diff
  -    sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead}
  +    sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead}

  Repro

  tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \
    -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \
    -v=3 -mode=1 -kname=1 -sink_grad=1

  Verification

  - 0/30 fail on the repro config after fix
  - Baselines (before fix):
    - sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4)
    - sink=1, mask=t: 67% fail rate (p ≈ 6e-15)

  Attribution

Shape bug introduced together with sink_grad in #5504. Unrelated to
#6914
  (which is a fwd-only fix on a different code path)
```

## Submission Checklist

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-13 08:49:13 +00:00
..
2026-01-14 07:31:45 -08:00