Files
composable_kernel/include
juuso-oskari f947db93fe CK-UA: fix wide-MMA FP8 P relayout (was cvt-only, missing QK-C->PV-A transpose)
The wide 32x32x64 FP8 path shipped a "cvt-only, layouts coincide" P relayout
(strategy C) that did no cross-lane movement, on the claim that the QK-C output
and PV-A input per-thread layouts already match at K=64. They don't: QK-C holds
one kv across many query rows while PV-A needs one query across many kv (a
transpose), so the relayout MUST do the cross-lane permlane32_swap.

Both the 32x32x16 and 32x32x64 MMAs share an identical 32x32 C-output
distribution; only kABKPerLane changes (8->32), i.e. the per-lane chunk COUNT,
not the per-chunk swap pattern. So strategy A's existing fused cvt+permlane32
relayout is correct for K=64 too -- the 8-fp8 loop just runs more iterations.

The bug was masked by near-uniform softmax (transposing a flat P barely moves
the row-sum), surfacing only as a few large-delta output lanes -> ~0.6-11% of
elements over the loose fp8 tol on prefill_fp8, while bf16 and fp8 decode passed.

Fix: gate strategy A for K=64 in addition to K=16; delete the cvt-only branch.
prefill_fp8 + the full correctness matrix now PASS (standalone host-ref mismatch
0.96% -> 0.0000%); standalone perf holds ~1.66k TFLOPs (the permlane32 ops are
cheap and overlap under softmax).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-12 08:18:21 +00:00
..