mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
Adds compile-time levers, all guarded and bit-identical to production when
unset, used to characterise why prefill_d128 fp8 fits KV tile 64 but not 128
under the 256-VGPR/wave ceiling (see ua-test-scripts/kv128_vgpr_findings.md):
- UA_PREFILL_D128_BLOCKSIZE (default 64): KV-tile override for probing kv128.
- UA_FA4_INPLACE_DELTA (default 0): drop sp_delta, scale-shift/exp2 in place on
sp_compute (fmha_alu_D_upd reads only m/l/o_acc/rowsum_p, never raw scores, so
bit-identical). VGPR-neutral on its own (compiler already reclaims sp_delta).
- UA_FA4_SHARED_SPCOMPUTE (default 0): keep ONE shared fp32 sp_compute + a
2-slot fp8 P ping-pong instead of a 2-slot union{sp_compute,p}. The deferred
PV only needs one live fp32 score; this cuts kv128 spills 173 -> 126. (Forces
in-place delta; slightly regresses kv64 so it is a kv128-only lever.)
- UA_FA4_UNION_KV (default 0): union k_tile/v_tile (ASM-style). VGPR-neutral;
kept as a documented dead end (compiler already overlaps their live ranges).
P thread-buffer size exposed as a type-derived constexpr (kPThreadBufSize) so
the static_assert/static_for sites work when sp(idx) is the runtime proxy.
Co-authored-by: Cursor <cursoragent@cursor.com>