Files
composable_kernel/example
Amir Ghamarian ea157f6244 Route all prefill to 4-warp kBlockM=128 kernel
Exhaustive sweep over 363 production trace shapes shows the 4-warp
serial pipeline outperforms the 8-warp interleaved pipeline on every
single prefill shape (0 exceptions out of 71 prefill shapes).

The 4-warp kernel has better CU occupancy and the serial pipeline's
async prefetch is sufficient for these workloads.

Dispatch now: tiny (decode) -> small (short decode) -> medium (all prefill).
The 8-warp large tier is no longer used for d64 GQA-8.

Made-with: Cursor
2026-03-28 13:52:42 +00:00
..
2026-01-14 07:31:45 -08:00