Files
composable_kernel/example
juuso-oskari fddb0d21cd Add d=128 MHA decode variant (decode_d128_mha_m128)
Until now every d=128 MHA workload took the 8-warp prefill kernel
(kBlockM=256, kBlockQ=256), wasting 255/256 Q rows on pure-decode
shapes where Q is 1. Add a dedicated 4-warp decode variant with
kBlockM=128 (kBlockQ=128) that cuts the Q-tile waste roughly in half.

  * Four new instance files at instances/unified_attention_d128_*_decode.cpp,
    each instantiating unified_attention_decode_kernel_traits<dt, mask, 128, 128, 1>.
  * KernelVariant::decode_d128_mha_m128 wired into select_config: chosen
    when both avg_q and max_seqlen_q fit in 128, else fall back to prefill.

Tests: ua-test-scripts/test_unified_attention_ck_correctness.py stays at
236/240 -- the pure-decode seq_lens pattern in head_config=(16,16,128)
now routes to the new variant and matches the torch reference. The 4
remaining failures are the pre-existing int32-overflow case (orthogonal).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 09:34:52 +00:00
..
2026-01-14 07:31:45 -08:00