mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-30 11:47:48 +00:00
Introduce UnifiedAttentionPipelineDecodePolicy with NumWarpPerGroup=2, enabling sequence<2,1,1> (2 warps, 1D layout along M). This gives kBlockM=64, kBlockQ=8 for GQA-8, reducing Q tile padding waste from 15/16 (kBlockM=128) to 7/8 for decode workloads. Key insight: instead of fighting with 2D warp layouts (which break the permlane32_swap softmax reduction), use fewer warps with a smaller NumWarpPerGroup. The 1D warp layout is preserved so no reduction changes are needed. Benchmark (64-seq decode, d64 GQA-8): kBlockM=128 (prev): 0.03406ms kBlockM=64 (this): 0.03247ms (~4.7% faster) Total vs baseline: 0.06177ms -> 0.03247ms (1.90x speedup) Made-with: Cursor