mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
`ADD_SBARRIER_FOR_PHASE0=1` added an extra `s_barrier()` at the start of
every `cl_p` half of every KV iteration, on top of the three barriers
that already gate the LDS hand-offs in phases 1/2/3.
rocprofv3 bottleneck analysis (b=4 sq=8 sk=4096 hq=64 hk=8 d=128 fp8):
the prefill_d128 8-warp variant spends ~15% of GUI_ACTIVE cycles at
s_barriers and shows %any_wait ≈ 200%. PC sampling pinpoints the
phase-0 `s_barrier` (right after softmax rescale, before async prefetch)
as a top hotspot.
Examining the data flow shows the phase-0 barrier is redundant:
- phase1's `s_waitcnt vmcnt(...); s_barrier` guards the K-LDS write
(from the previous iter's K async load) before any warp reads it.
- phase2's `s_waitcnt lgkmcnt(0); s_barrier` guards the softmax-P
LDS write before gemm1 reads it.
- phase3's `s_waitcnt vmcnt(...); s_barrier` guards the V-LDS write
before the next iter's V-LDS read.
These three already provide every cross-warp ordering the pipeline
needs. The phase-0 barrier was purely defensive.
Measurement: 0.1945 → 0.1883 ms (n=300 iters × 3 trials, single shape).
Correctness verified against the Triton reference on fp8/bf16/fp16 ×
{b=4/32/128} × {sq=1/4/8} × {causal,non-causal} × d∈{64,128}.
Leaving the macro and the `=1` documented path in place so the previous
behaviour can be restored if a future arch/shape regresses.
Co-authored-by: Cursor <cursoragent@cursor.com>