composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 11:47:48 +00:00

Files

Amir Ghamarian ae1d09f545 Add 2-warp decode kernel with kBlockM=64 for minimal tile waste

Introduce UnifiedAttentionPipelineDecodePolicy with NumWarpPerGroup=2,
enabling sequence<2,1,1> (2 warps, 1D layout along M). This gives
kBlockM=64, kBlockQ=8 for GQA-8, reducing Q tile padding waste from
15/16 (kBlockM=128) to 7/8 for decode workloads.

Key insight: instead of fighting with 2D warp layouts (which break the
permlane32_swap softmax reduction), use fewer warps with a smaller
NumWarpPerGroup. The 1D warp layout is preserved so no reduction changes
are needed.

Benchmark (64-seq decode, d64 GQA-8):
  kBlockM=128 (prev): 0.03406ms
  kBlockM=64  (this): 0.03247ms (~4.7% faster)
  Total vs baseline:  0.06177ms -> 0.03247ms (1.90x speedup)

Made-with: Cursor

2026-03-28 10:57:10 +00:00

Implement batched gemm bias permute for RDNA4 (#3534 )

2026-01-17 08:30:27 +01:00

ck_tile

Add 2-warp decode kernel with kBlockM=64 for minimal tile waste

2026-03-28 10:57:10 +00:00

rapidjson

Update pre-commit to fixed versions, run remod for ck_tile (#2895 )

2025-10-16 15:29:17 -07:00