mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 19:28:33 +00:00
kv_tile held k_tile/v_tile in a *union* to save VGPRs, but on gfx950 the union forced a hard serialization: K_lds_load wrote the same registers the PV MFMA was reading (v_tile), so the K ds_read could not start until the PV MFMA retired -> full LDS latency exposed at the QK gemm's s_waitcnt_lgkmcnt<0> (ATT: ~half of all memwait stall). Make k_tile/v_tile separate registers and pin K_lds_load between the PV and QK MFMAs with sched_barrier so its ds_read executes on the LSU *during* the PV MFMA (latency hidden) without being hoisted above PV (which would race the partner WG's cooperative K load on long contexts). Occupancy is VGPR-bound here (160KB LDS), and .vgpr_count is unchanged (172 -> 172), so the change is free. Standalone fp8 d128 sq8192: ~515.7 -> ~497.5 us (-3.7%), memwait 31% -> 19%, accuracy 0% mismatch. Co-authored-by: Cursor <cursoragent@cursor.com>