mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 02:02:46 +00:00
In group mode the dq_acc workspace layout uses physical (padded) seqlen_q for the per-nhead stride (see FmhaBwdWorkspaceManager doc; also matches FmhaBwdConvertQGradKernel reads). The unified-workspace refactor inlined this stride as kargs.seqlen_q, which is the LOGICAL length when seqlen_q_ptr is provided. The result: main kernel writes batch i nhead>0 dq_acc at offsets that the convert kernel never reads, so dQ ends up zero for those positions. Hoist physical_seqlen_q to the outer scope and use it for both the non-deterministic and deterministic stride computations in the dq_dram_window lambda. Batch mode is unaffected since kargs.seqlen_q already equals the physical length there. Fixes 135 padding-related failures in test_ck_tile_fmha_bwd_fp16 (BasicQPadding / MultiBatchPadding / PaddingWithMask / QKVPadding / VariedPaddingRatios / ZeroLengthPadding / Deterministic / ElementwiseBias). Verified locally: full suite 672 PASSED / 0 FAILED. SGPR usage drops by 1; VGPR/AGPR/spill/occupancy unchanged.