[rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e)

[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-05 14:11:29 +00:00 · 2026-04-01 05:45:19 +00:00
parent a33b5be1b9
commit 791afc6465
1 changed files with 3 additions and 1 deletions
--- a/include/ck_tile/ops/fmha/pipeline/block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp.hpp
+++ b/include/ck_tile/ops/fmha/pipeline/block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp.hpp
@@ -163,7 +163,9 @@ struct BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP
            amd_wave_read_first_lane(integer_divide_ceil(seqlen_q_end - seqlen_q_start, kM0));

        // check early exit if no work to do.
-        if(num_total_loop <= 0)
+        // __builtin_expect is load-bearing: omitting it causes incorrect AGPR allocation in
+        // the dK/dV accumulation loop on some compiler versions, leading to wrong results.
+        if(__builtin_expect(num_total_loop <= 0, 0))
        {
            // Note: here dk_acc&dv_acc are all cleared, return it
            return make_tuple(dk_acc, dv_acc);