Files
composable_kernel/include
juuso-oskari 7fc24c8c45 CK-UA: within-tile page-table dedup + UA-owned core-loop scheduler
Two prefill_d128 changes on the unified-attention pingpong (checkpoint):

1. refresh_{k,v}_offsets: dedup the per-issue page-table lookup. With a
   compile-time page_size the issue->page map is a pure compile-time
   function in two provable regimes (page-divides-tile / tile-divides-
   page), so phys_page is resolved once per distinct page instead of
   once per issue -- collapsing to a single ds_read + readfirstlane at
   page_size >= kPageBlockSize. Gated on kHasCePageSize; the runtime-
   page-size scalar-promote and per-lane fallbacks stay byte-identical.
   Measured fp8 prefill (ps=64), amir-shape sweep: +6.8% aggregate
   (5-7%/shape, scaling with seqlen); B2 K-mem barrier straggler
   -21..25%, total mean barrier stall -12%. Correctness verified
   fp8 ps={32,64} and bf16 ps={16,32,64}.

   (A cross-tile phys_page memo was prototyped and reverted: the Tier-2
   LDS read is already cheap/hidden post-dedup, so the runtime guard +
   loop-carried dep it needed was a net ~0.3% regression.)

2. Fork the FMHA CoreLoopScheduler into a UA-owned UAcoreLoopScheduler
   and thread MOVE_FMHA_MASK_TO_COMPUTE through its sched_group_barrier
   hints so the per-phase instruction-mix hint stays in lockstep with
   mask code motion. With the macro at 0 the table is byte-identical to
   the upstream FMHA scheduler (same hints, same codegen).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-01 09:01:21 +00:00
..