Until now every d=128 MHA workload took the 8-warp prefill kernel
(kBlockM=256, kBlockQ=256), wasting 255/256 Q rows on pure-decode
shapes where Q is 1. Add a dedicated 4-warp decode variant with
kBlockM=128 (kBlockQ=128) that cuts the Q-tile waste roughly in half.
* Four new instance files at instances/unified_attention_d128_*_decode.cpp,
each instantiating unified_attention_decode_kernel_traits<dt, mask, 128, 128, 1>.
* KernelVariant::decode_d128_mha_m128 wired into select_config: chosen
when both avg_q and max_seqlen_q fit in 128, else fall back to prefill.
Tests: ua-test-scripts/test_unified_attention_ck_correctness.py stays at
236/240 -- the pure-decode seq_lens pattern in head_config=(16,16,128)
now routes to the new variant and matches the torch reference. The 4
remaining failures are the pre-existing int32-overflow case (orthogonal).
Co-authored-by: Cursor <cursoragent@cursor.com>