- Add strategy C (cvt-only, barrier-free) QK-C->PV-A FP8 relayout for the
K=64 wide v_mfma_f32_32x32x64 tile: QK-C and PV-A per-thread layouts
coincide under the wide MMA, so the relayout is just the fp32->fp8 pack
(matches the ASM kernel's _softmax_pack_P_fp8). Gate kFP8RelayoutWithinWave
for K=64 in addition to K=16; both are FA4-safe (no in-softmax barrier).
- Wire the wide-MMA variant config (example) + relayout default policy.
- Move the FA4 V LDS transpose-read out of the preceding SOFTMAX into the
MATRIX phase, off the longer/critical softmax path (UA_FA4_VLOAD_IN_MATRIX=1).
- Add UA_FA4_PIN_PACK_IN_SOFTMAX experiment toggle (default 0).
Measured: wide MMA closed the structural gap vs the ASM fp8 kernel from
~1.75x to ~1.16x at b1/h5/sq75600/d128 (1711 TF standalone).
Co-authored-by: Cursor <cursoragent@cursor.com>