Files
composable_kernel/include
juuso-oskari 0f009a3442 CK-UA: fix split-KV partition for non-dividing GQA + add ps128 decode instances
Under split-KV, a KV token co-owned by two query tiles (which happens only
when num_queries_per_kv does not divide kBlockM, e.g. d=128 qpkv=6) was
assigned its split partition from the per-tile causal horizon
(total_num_kv_blocks, which grows with the query tile index). The two owning
tiles then reduced disjoint KV-block ranges for that shared token and the
combine step merged partials over different ranges -> a ~1-row error / NaN on
the tile-boundary token. MHA and ratios that divide kBlockM are immune (no
token is shared across tiles).

Fix: derive blocks_per_split from the causal-INDEPENDENT full-sequence block
count so split s maps to the same blocks in every query tile, then clamp only
the END by the per-tile causal horizon. The duplicate co-owned store becomes
idempotent again. num_splits == 1 is unchanged.

Also adds the d128 bf16 page_size=128 decode instances (mask/nmask x
default/s/t) plus the matching dispatch in unified_attention.cpp and the
fmha_batch_prefill codegen hook.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-08 08:46:55 +00:00
..