composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 17:19:12 +00:00

Files

root cb6fb2802d Split-KV codegen: dual-tile dispatch and head-merge for hdim=64

1. Dual-tile: add both bn0=64 (preferred) and bn0=32 (fallback) for
   hdim=64 on gfx9 and gfx12. The dispatch checks page_block_size %
   bn0 == 0 at runtime to select the optimal tile. bn0=64 halves KV
   iterations when page_block_size >= 64.

2. Tile dict now supports lists per hdim. The codegen loop iterates
   over all tile variants, generating separate kernel instances for
   each. Combine kernels are unaffected (tile-independent).

3. Enable kMergeNumHeadGroupsSeqLenQ for hdim=64 decode (previously
   hdim=128 only). For GQA-8 with max_seqlen_q=1, this packs 8 head
   groups into the M dimension. Only activates for no-mask instances
   (kernel static_assert requires !kHasMask).

4. Add qr (non-async) pipeline for fwd non-bias group mode as
   fallback after qr_async. The async pipeline on this branch has a
   kernel-level bug where fmha_fwd launches but writes no output.

Made-with: Cursor

2026-04-01 16:24:25 +00:00

ops

Split-KV codegen: dual-tile dispatch and head-merge for hdim=64

2026-04-01 16:24:25 +00:00

__init__.py

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

arch.py

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

cmake_config.py

chore(copyright): update copyright header for example directory (#3273 )