[rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e)

[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR
 misallocation (#5991)

## Motivation

After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard
around the
`num_total_loop <= 0` early-exit check, the IGLP pipeline
(`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV
gradients for
non-masking kernels (even with fix in #5915). Assembly inspection
confirms that the CFG change causes the LLVM
register allocator to reuse AGPR accumulators as scratch destinations in
the dK/dV
reduction loop, breaking the loop-carried accumulation across Q-tile
iterations.

## Technical Details

- Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in
`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing:
it
restores the CFG shape that the register allocator needs to correctly
assign
  dedicated AGPRs to each column of the dK/dV accumulator.
- Only the IGLP pipeline is affected; the other two BWD pipelines do not
exhibit
  this issue.

## Test Plan

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
This commit is contained in:
Yi DING
2026-04-01 05:45:19 +00:00
committed by assistant-librarian[bot]
parent a33b5be1b9
commit 791afc6465

View File

@@ -163,7 +163,9 @@ struct BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP
amd_wave_read_first_lane(integer_divide_ceil(seqlen_q_end - seqlen_q_start, kM0));
// check early exit if no work to do.
if(num_total_loop <= 0)
// __builtin_expect is load-bearing: omitting it causes incorrect AGPR allocation in
// the dK/dV accumulation loop on some compiler versions, leading to wrong results.
if(__builtin_expect(num_total_loop <= 0, 0))
{
// Note: here dk_acc&dv_acc are all cleared, return it
return make_tuple(dk_acc, dv_acc);