mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 10:09:41 +00:00
1. Dual-tile: add both bn0=64 (preferred) and bn0=32 (fallback) for hdim=64 on gfx9 and gfx12. The dispatch checks page_block_size % bn0 == 0 at runtime to select the optimal tile. bn0=64 halves KV iterations when page_block_size >= 64. 2. Tile dict now supports lists per hdim. The codegen loop iterates over all tile variants, generating separate kernel instances for each. Combine kernels are unaffected (tile-independent). 3. Enable kMergeNumHeadGroupsSeqLenQ for hdim=64 decode (previously hdim=128 only). For GQA-8 with max_seqlen_q=1, this packs 8 head groups into the M dimension. Only activates for no-mask instances (kernel static_assert requires !kHasMask). 4. Add qr (non-async) pipeline for fwd non-bias group mode as fallback after qr_async. The async pipeline on this branch has a kernel-level bug where fmha_fwd launches but writes no output. Made-with: Cursor