Introduce UnifiedAttentionPipelineDecodePolicy with NumWarpPerGroup=2,
enabling sequence<2,1,1> (2 warps, 1D layout along M). This gives
kBlockM=64, kBlockQ=8 for GQA-8, reducing Q tile padding waste from
15/16 (kBlockM=128) to 7/8 for decode workloads.
Key insight: instead of fighting with 2D warp layouts (which break the
permlane32_swap softmax reduction), use fewer warps with a smaller
NumWarpPerGroup. The 1D warp layout is preserved so no reduction changes
are needed.
Benchmark (64-seq decode, d64 GQA-8):
kBlockM=128 (prev): 0.03406ms
kBlockM=64 (this): 0.03247ms (~4.7% faster)
Total vs baseline: 0.06177ms -> 0.03247ms (1.90x speedup)
Made-with: Cursor