mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
Two prefill_d128 changes on the unified-attention pingpong (checkpoint):
1. refresh_{k,v}_offsets: dedup the per-issue page-table lookup. With a
compile-time page_size the issue->page map is a pure compile-time
function in two provable regimes (page-divides-tile / tile-divides-
page), so phys_page is resolved once per distinct page instead of
once per issue -- collapsing to a single ds_read + readfirstlane at
page_size >= kPageBlockSize. Gated on kHasCePageSize; the runtime-
page-size scalar-promote and per-lane fallbacks stay byte-identical.
Measured fp8 prefill (ps=64), amir-shape sweep: +6.8% aggregate
(5-7%/shape, scaling with seqlen); B2 K-mem barrier straggler
-21..25%, total mean barrier stall -12%. Correctness verified
fp8 ps={32,64} and bf16 ps={16,32,64}.
(A cross-tile phys_page memo was prototyped and reverted: the Tier-2
LDS read is already cheap/hidden post-dedup, so the runtime guard +
loop-carried dep it needed was a net ~0.3% regression.)
2. Fork the FMHA CoreLoopScheduler into a UA-owned UAcoreLoopScheduler
and thread MOVE_FMHA_MASK_TO_COMPUTE through its sched_group_barrier
hints so the per-phase instruction-mix hint stays in lockstep with
mask code motion. With the macro at 0 the table is byte-identical to
the upstream FMHA scheduler (same hints, same codegen).
Co-authored-by: Cursor <cursoragent@cursor.com>