mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-05 14:11:29 +00:00
[rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e)
[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
This commit is contained in:
committed by
assistant-librarian[bot]
parent
a33b5be1b9
commit
791afc6465
@@ -163,7 +163,9 @@ struct BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP
|
||||
amd_wave_read_first_lane(integer_divide_ceil(seqlen_q_end - seqlen_q_start, kM0));
|
||||
|
||||
// check early exit if no work to do.
|
||||
if(num_total_loop <= 0)
|
||||
// __builtin_expect is load-bearing: omitting it causes incorrect AGPR allocation in
|
||||
// the dK/dV accumulation loop on some compiler versions, leading to wrong results.
|
||||
if(__builtin_expect(num_total_loop <= 0, 0))
|
||||
{
|
||||
// Note: here dk_acc&dv_acc are all cleared, return it
|
||||
return make_tuple(dk_acc, dv_acc);
|
||||
|
||||
Reference in New Issue
Block a user