Files
composable_kernel/example
juuso-oskari 80009b4c82 CK-UA: paged ps128 fast path for fp8 prefill_d128 at contiguous parity
Add single-page (page_blk_size == kPageBlockSize == 128) paged instances for
fp8 prefill_d128 (nmask + mask) and route page_blk_size==128 to them, so the
canonical prefill shape exercises the kRebaseKSrd single-page addressing fast
path (1 readfirstlane + 1 LDS block-table read + SRD rebase per tile).

Two address-overhead optimizations bring the paged kernel to parity with (and
slightly above) the contiguous baseline on b1 sq75600 hq=hk=5 d128 fp8:

- Share the K/V per-lane scatter array. In the single-page rebase regime
  k_page_offsets == v_page_offsets bit-for-bit (same kv_cache strides, same
  fp8 DRAM distribution), so feed one loop-invariant array to both scatter-
  gather windows; the backend then coalesces the duplicated page_idx_ storage
  (nmask spills 5->1, mask 6->2).

- Cross-stagger phys_page carry. K runs one tile ahead of V and shares the
  logical->physical page map, so V reuses the phys_page K already broadcast
  (R=2 tile-parity ring in SGPRs) instead of issuing its own block-table LDS
  read + readfirstlane. WG1 addr-phase stall drops 940 -> 64 cyc.

Standalone (gfx950): paged 1810 -> ~1910 TFLOP/s noncausal (matches contiguous
~1905); causal 1463 -> 1592. 0% mismatch vs host reference. Contiguous path is
unchanged -- all new code is gated on the paged single-page rebase flags.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-15 12:53:50 +00:00
..
2026-01-14 07:31:45 -08:00