Files
composable_kernel/include/ck_tile/ops
juuso-oskari 382bb198eb CK-UA: freeze docs + comment cleanup (+ gated decode-ring scaffolding)
- Add an architecture README for the unified_attention kernel folder: file
  map, per-CTA work assignment, online-softmax math + scale fusion, FA4 vs
  serial-decode regimes, paged-KV tiers, split-KV, and a tuning-knobs/failed-
  experiments table. Intended as the reference for the FlyDSL port.
- pipeline.hpp: condense the ~230-line experiment-macro header into terse,
  README-backed one-liners (all 13 macro definitions preserved bit-for-bit).
- kernel.hpp: merge the duplicated/contradictory "Step D" SWA-clip comment.
- Gated multi-stage decode async-ring scaffolding (UA_DECODE_STAGES, default
  2 = bit-identical; deeper depth measured perf-neutral, decode is BW-bound).

Full matrix 263/263 PASS, 0 fail; comment-only kernel edits are
behavior-neutral (target fp8 decode shape unchanged at ~88us).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-17 09:14:44 +00:00
..