Files
composable_kernel/example
Amir Ghamarian ae1d09f545 Add 2-warp decode kernel with kBlockM=64 for minimal tile waste
Introduce UnifiedAttentionPipelineDecodePolicy with NumWarpPerGroup=2,
enabling sequence<2,1,1> (2 warps, 1D layout along M). This gives
kBlockM=64, kBlockQ=8 for GQA-8, reducing Q tile padding waste from
15/16 (kBlockM=128) to 7/8 for decode workloads.

Key insight: instead of fighting with 2D warp layouts (which break the
permlane32_swap softmax reduction), use fewer warps with a smaller
NumWarpPerGroup. The 1D warp layout is preserved so no reduction changes
are needed.

Benchmark (64-seq decode, d64 GQA-8):
  kBlockM=128 (prev): 0.03406ms
  kBlockM=64  (this): 0.03247ms (~4.7% faster)
  Total vs baseline:  0.06177ms -> 0.03247ms (1.90x speedup)

Made-with: Cursor
2026-03-28 10:57:10 +00:00
..
2026-01-14 07:31:45 -08:00