Files
composable_kernel/example
juuso-oskari 5d1def74a6 CK-UA: remove legacy ping-pong pipeline; FA4 is the only 2-WG path
Pipeline cleanup (-fav4):
  * Delete the 8-wave compute/memory ping-pong baseline (the ~200-line
    monolithic `core_loop` lambda + its 2-warp-group dispatch). It was
    reachable only under -DUA_FA4_PIPELINE=0 and never beat FA4 on any
    measured prefill shape, so it was dead under the default build.
  * Drop the UA_FA4_PIPELINE toggle entirely. kFA4 is now derived purely
    from NumWarpGroups==2 + the 32x32x16 within-wave FP8 P-relayout
    invariant, with a static_assert pinning that every 2-WG instance is
    FA4-capable (fails the build loudly instead of running an empty loop).
  * Remove the now-orphaned ADD_SBARRIER_FOR_PHASE0/PHASE2 knobs (they
    only gated barriers inside the deleted core_loop). MOVE_FMHA_MASK_*
    stay (still consumed by the FA4 core-loop scheduler).
  * The non-FA4 pre-stage + fmha_post_process epilogue are retained: they
    are shared by the single-warp-group (NumWarpGroups==1) serial decode
    path, where kFA4 is false.

Behaviour-preserving for the default build: FA4 prefill perf is bit-for-
bit unchanged (b16 sq=sk=10000 fp8 CK=5.76ms before/after) and the full
decode regression (d{64,128} x {bf16,fp8} x split-KV {2,64}) still PASSes.

Add opt-in prefill fallback knob (unified_attention.cpp):
  * AITER_UA_PREFILL_FALLBACK=1 routes prefill-sized shapes to the 4-warp
    single-warp-group *serial* decode_*_m128 instances instead of FA4.
    Reuses already-compiled instances (no extra binary). OFF by default:
    the serial path has no matrix/softmax overlap and measured ~0.66-0.70x
    Triton vs FA4's ~0.73-0.80x on gfx950 fp8 GQA-12/2 (i.e. SLOWER than
    FA4). Kept as a diagnostic / robustness A-B knob only.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-03 09:15:30 +00:00
..
2026-01-14 07:31:45 -08:00