From 9b8b2456b4b17c7ad2d4f25f869e197ea6ccd15d Mon Sep 17 00:00:00 2001 From: Yi DING Date: Wed, 1 Apr 2026 13:44:04 +0800 Subject: [PATCH] [CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --- .../block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp.hpp | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/ck_tile/ops/fmha/pipeline/block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp.hpp b/include/ck_tile/ops/fmha/pipeline/block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp.hpp index 97db0f95c4..34ba8d6c47 100644 --- a/include/ck_tile/ops/fmha/pipeline/block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp.hpp +++ b/include/ck_tile/ops/fmha/pipeline/block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp.hpp @@ -163,7 +163,9 @@ struct BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP amd_wave_read_first_lane(integer_divide_ceil(seqlen_q_end - seqlen_q_start, kM0)); // check early exit if no work to do. - if(num_total_loop <= 0) + // __builtin_expect is load-bearing: omitting it causes incorrect AGPR allocation in + // the dK/dV accumulation loop on some compiler versions, leading to wrong results. + if(__builtin_expect(num_total_loop <= 0, 0)) { // Note: here dk_acc&dv_acc are all cleared, return it return make_tuple(dk_acc, dv_acc);