Files
composable_kernel/include
juuso-oskari 310efc556f CK-UA: halve kBlockN for bf16/fp16 m16 decode + generalise PVAttrNumAccess
The decode_d128_m16 tier was VGPR-saturated and LDS-bound on bf16/fp16
(probe_decode_d128 showed VGPR=256 + AGPR overflow, ~2x fp8's LDS at
the same kBlockN), capping it at 1 CTA/CU. Halving kBlockN for the
non-fp8 path on the m16 tier sheds enough LDS and VGPR pressure to
fit 3-4 CTAs/CU (LDS-bound). The halved kBlockN forces a smaller-K
MFMA shape on the m16 PV gemm (16x16x32 -> 16x16x16); we also auto-
adjust WarpGemm::K so PVAttrNumAccess picks Single vs Double access
correctly. The PVAttrNumAccess derivation is now generic — driven by
(kABKPerLane, SubMinDim) rather than just (dtype) — so the new
shape compiles without per-variant special-casing.

Variants only affected where cfg::BlockSize/2 >= WarpGemm::N (i.e.
decode_d128_m16); m32/m128/prefill keep their un-halved tiles since
they use 32x32 N-warps.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 08:20:55 +00:00
..