mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
- Add an architecture README for the unified_attention kernel folder: file map, per-CTA work assignment, online-softmax math + scale fusion, FA4 vs serial-decode regimes, paged-KV tiers, split-KV, and a tuning-knobs/failed- experiments table. Intended as the reference for the FlyDSL port. - pipeline.hpp: condense the ~230-line experiment-macro header into terse, README-backed one-liners (all 13 macro definitions preserved bit-for-bit). - kernel.hpp: merge the duplicated/contradictory "Step D" SWA-clip comment. - Gated multi-stage decode async-ring scaffolding (UA_DECODE_STAGES, default 2 = bit-identical; deeper depth measured perf-neutral, decode is BW-bound). Full matrix 263/263 PASS, 0 fail; comment-only kernel edits are behavior-neutral (target fp8 decode shape unchanged at ~88us). Co-authored-by: Cursor <cursoragent@cursor.com>