mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
Working state before the pipeline cleanup/refactor:
* FA4 matrix-softmax warp-group overlap pipeline (UA_FA4_PIPELINE=1).
* Widen per-CTA query/output base offsets to long_index_t so large
total_q (big-batch prefill) can't overflow int32 and fault on the
output store (cache_ptr_int32_overflow_possible only covers K/V).
Co-authored-by: Cursor <cursoragent@cursor.com>