Files
composable_kernel/include
juuso-oskari 26bc49f733 CK-UA: un-union kv_tile so K ds_read overlaps PV MFMA (FA4 fp8)
kv_tile held k_tile/v_tile in a *union* to save VGPRs, but on gfx950 the
union forced a hard serialization: K_lds_load wrote the same registers
the PV MFMA was reading (v_tile), so the K ds_read could not start until
the PV MFMA retired -> full LDS latency exposed at the QK gemm's
s_waitcnt_lgkmcnt<0> (ATT: ~half of all memwait stall).

Make k_tile/v_tile separate registers and pin K_lds_load between the PV
and QK MFMAs with sched_barrier so its ds_read executes on the LSU
*during* the PV MFMA (latency hidden) without being hoisted above PV
(which would race the partner WG's cooperative K load on long contexts).

Occupancy is VGPR-bound here (160KB LDS), and .vgpr_count is unchanged
(172 -> 172), so the change is free. Standalone fp8 d128 sq8192:
~515.7 -> ~497.5 us (-3.7%), memwait 31% -> 19%, accuracy 0% mismatch.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-09 10:46:56 +00:00
..