mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 19:28:33 +00:00
Under split-KV, a KV token co-owned by two query tiles (which happens only when num_queries_per_kv does not divide kBlockM, e.g. d=128 qpkv=6) was assigned its split partition from the per-tile causal horizon (total_num_kv_blocks, which grows with the query tile index). The two owning tiles then reduced disjoint KV-block ranges for that shared token and the combine step merged partials over different ranges -> a ~1-row error / NaN on the tile-boundary token. MHA and ratios that divide kBlockM are immune (no token is shared across tiles). Fix: derive blocks_per_split from the causal-INDEPENDENT full-sequence block count so split s maps to the same blocks in every query tile, then clamp only the END by the per-tile causal horizon. The duplicate co-owned store becomes idempotent again. num_splits == 1 is unchanged. Also adds the d128 bf16 page_size=128 decode instances (mask/nmask x default/s/t) plus the matching dispatch in unified_attention.cpp and the fmha_batch_prefill codegen hook. Co-authored-by: Cursor <cursoragent@cursor.com>