mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
Pipeline cleanup (-fav4):
* Delete the 8-wave compute/memory ping-pong baseline (the ~200-line
monolithic `core_loop` lambda + its 2-warp-group dispatch). It was
reachable only under -DUA_FA4_PIPELINE=0 and never beat FA4 on any
measured prefill shape, so it was dead under the default build.
* Drop the UA_FA4_PIPELINE toggle entirely. kFA4 is now derived purely
from NumWarpGroups==2 + the 32x32x16 within-wave FP8 P-relayout
invariant, with a static_assert pinning that every 2-WG instance is
FA4-capable (fails the build loudly instead of running an empty loop).
* Remove the now-orphaned ADD_SBARRIER_FOR_PHASE0/PHASE2 knobs (they
only gated barriers inside the deleted core_loop). MOVE_FMHA_MASK_*
stay (still consumed by the FA4 core-loop scheduler).
* The non-FA4 pre-stage + fmha_post_process epilogue are retained: they
are shared by the single-warp-group (NumWarpGroups==1) serial decode
path, where kFA4 is false.
Behaviour-preserving for the default build: FA4 prefill perf is bit-for-
bit unchanged (b16 sq=sk=10000 fp8 CK=5.76ms before/after) and the full
decode regression (d{64,128} x {bf16,fp8} x split-KV {2,64}) still PASSes.
Add opt-in prefill fallback knob (unified_attention.cpp):
* AITER_UA_PREFILL_FALLBACK=1 routes prefill-sized shapes to the 4-warp
single-warp-group *serial* decode_*_m128 instances instead of FA4.
Reuses already-compiled instances (no extra binary). OFF by default:
the serial path has no matrix/softmax overlap and measured ~0.66-0.70x
Triton vs FA4's ~0.73-0.80x on gfx950 fp8 GQA-12/2 (i.e. SLOWER than
FA4). Kept as a diagnostic / robustness A-B knob only.
Co-authored-by: Cursor <cursoragent@cursor.com>