composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Files

Ding, Yi b3a5e7ff64 [CK_TILE] Fix dq_acc per-nhead stride in FMHA BWD group mode

In group mode the dq_acc workspace layout uses physical (padded)
seqlen_q for the per-nhead stride (see FmhaBwdWorkspaceManager doc;
also matches FmhaBwdConvertQGradKernel reads). The unified-workspace
refactor inlined this stride as kargs.seqlen_q, which is the LOGICAL
length when seqlen_q_ptr is provided. The result: main kernel writes
batch i nhead>0 dq_acc at offsets that the convert kernel never reads,
so dQ ends up zero for those positions.

Hoist physical_seqlen_q to the outer scope and use it for both the
non-deterministic and deterministic stride computations in the
dq_dram_window lambda. Batch mode is unaffected since kargs.seqlen_q
already equals the physical length there.

Fixes 135 padding-related failures in test_ck_tile_fmha_bwd_fp16
(BasicQPadding / MultiBatchPadding / PaddingWithMask / QKVPadding /
VariedPaddingRatios / ZeroLengthPadding / Deterministic /
ElementwiseBias). Verified locally: full suite 672 PASSED / 0 FAILED.
SGPR usage drops by 1; VGPR/AGPR/spill/occupancy unchanged.

2026-04-27 01:54:53 -05:00

[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 )

2026-04-22 15:47:47 +00:00

ck_tile

[CK_TILE] Fix dq_acc per-nhead stride in FMHA BWD group mode

2026-04-27 01:54:53 -05:00

rapidjson

Update pre-commit to fixed versions, run remod for ck_tile (#2895 )

2025-10-16 15:29:17 -07:00