mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
The wide 32x32x64 FP8 path shipped a "cvt-only, layouts coincide" P relayout (strategy C) that did no cross-lane movement, on the claim that the QK-C output and PV-A input per-thread layouts already match at K=64. They don't: QK-C holds one kv across many query rows while PV-A needs one query across many kv (a transpose), so the relayout MUST do the cross-lane permlane32_swap. Both the 32x32x16 and 32x32x64 MMAs share an identical 32x32 C-output distribution; only kABKPerLane changes (8->32), i.e. the per-lane chunk COUNT, not the per-chunk swap pattern. So strategy A's existing fused cvt+permlane32 relayout is correct for K=64 too -- the 8-fp8 loop just runs more iterations. The bug was masked by near-uniform softmax (transposing a flat P barely moves the row-sum), surfacing only as a few large-delta output lanes -> ~0.6-11% of elements over the loose fp8 tol on prefill_fp8, while bf16 and fp8 decode passed. Fix: gate strategy A for K=64 in addition to K=16; delete the cvt-only branch. prefill_fp8 + the full correctness matrix now PASS (standalone host-ref mismatch 0.96% -> 0.0000%); standalone perf holds ~1.66k TFLOPs (the permlane32 ops are cheap and overlap under softmax). Co-authored-by: Cursor <cursoragent@cursor.com>