Add single-page (page_blk_size == kPageBlockSize == 128) paged instances for
fp8 prefill_d128 (nmask + mask) and route page_blk_size==128 to them, so the
canonical prefill shape exercises the kRebaseKSrd single-page addressing fast
path (1 readfirstlane + 1 LDS block-table read + SRD rebase per tile).
Two address-overhead optimizations bring the paged kernel to parity with (and
slightly above) the contiguous baseline on b1 sq75600 hq=hk=5 d128 fp8:
- Share the K/V per-lane scatter array. In the single-page rebase regime
k_page_offsets == v_page_offsets bit-for-bit (same kv_cache strides, same
fp8 DRAM distribution), so feed one loop-invariant array to both scatter-
gather windows; the backend then coalesces the duplicated page_idx_ storage
(nmask spills 5->1, mask 6->2).
- Cross-stagger phys_page carry. K runs one tile ahead of V and shares the
logical->physical page map, so V reuses the phys_page K already broadcast
(R=2 tile-parity ring in SGPRs) instead of issuing its own block-table LDS
read + readfirstlane. WG1 addr-phase stall drops 940 -> 64 cyc.
Standalone (gfx950): paged 1810 -> ~1910 TFLOP/s noncausal (matches contiguous
~1905); causal 1463 -> 1592. 0% mismatch vs host reference. Contiguous path is
unchanged -- all new code is gated on the paged single-page rebase flags.
Co-authored-by: Cursor <cursoragent@cursor.com>