Files
composable_kernel/include
juuso-oskari 7a319d9a4b CK-UA: drop redundant phase-0 s_barrier (-3% fp8 prefill_d128 decode)
`ADD_SBARRIER_FOR_PHASE0=1` added an extra `s_barrier()` at the start of
every `cl_p` half of every KV iteration, on top of the three barriers
that already gate the LDS hand-offs in phases 1/2/3.

rocprofv3 bottleneck analysis (b=4 sq=8 sk=4096 hq=64 hk=8 d=128 fp8):
the prefill_d128 8-warp variant spends ~15% of GUI_ACTIVE cycles at
s_barriers and shows %any_wait ≈ 200%. PC sampling pinpoints the
phase-0 `s_barrier` (right after softmax rescale, before async prefetch)
as a top hotspot.

Examining the data flow shows the phase-0 barrier is redundant:
  - phase1's `s_waitcnt vmcnt(...); s_barrier` guards the K-LDS write
    (from the previous iter's K async load) before any warp reads it.
  - phase2's `s_waitcnt lgkmcnt(0); s_barrier` guards the softmax-P
    LDS write before gemm1 reads it.
  - phase3's `s_waitcnt vmcnt(...); s_barrier` guards the V-LDS write
    before the next iter's V-LDS read.

These three already provide every cross-warp ordering the pipeline
needs. The phase-0 barrier was purely defensive.

Measurement: 0.1945 → 0.1883 ms (n=300 iters × 3 trials, single shape).
Correctness verified against the Triton reference on fp8/bf16/fp16 ×
{b=4/32/128} × {sq=1/4/8} × {causal,non-causal} × d∈{64,128}.

Leaving the macro and the `=1` documented path in place so the previous
behaviour can be restored if a future arch/shape regresses.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 19:16:48 +00:00
..