Files
composable_kernel/include
Ding, Yi b3a5e7ff64 [CK_TILE] Fix dq_acc per-nhead stride in FMHA BWD group mode
In group mode the dq_acc workspace layout uses physical (padded)
seqlen_q for the per-nhead stride (see FmhaBwdWorkspaceManager doc;
also matches FmhaBwdConvertQGradKernel reads). The unified-workspace
refactor inlined this stride as kargs.seqlen_q, which is the LOGICAL
length when seqlen_q_ptr is provided. The result: main kernel writes
batch i nhead>0 dq_acc at offsets that the convert kernel never reads,
so dQ ends up zero for those positions.

Hoist physical_seqlen_q to the outer scope and use it for both the
non-deterministic and deterministic stride computations in the
dq_dram_window lambda. Batch mode is unaffected since kargs.seqlen_q
already equals the physical length there.

Fixes 135 padding-related failures in test_ck_tile_fmha_bwd_fp16
(BasicQPadding / MultiBatchPadding / PaddingWithMask / QKVPadding /
VariedPaddingRatios / ZeroLengthPadding / Deterministic /
ElementwiseBias). Verified locally: full suite 672 PASSED / 0 FAILED.
SGPR usage drops by 1; VGPR/AGPR/spill/occupancy unchanged.
2026-04-27 01:54:53 -05:00
..