Move next iteration's K/V global loads (K_mem_load, V_mem_load) to
immediately after the barrier, before PV GEMM and K LDS read. This
overlaps the async global->LDS copies with the current iteration's
GEMM compute. Also remove redundant barriers between PV and QK phases
since K/V use separate LDS regions (no read/write conflicts).
Benchmark improvement (64-seq decode, d64 GQA-8):
Phase 1: 0.03564ms -> Phase 2: 0.03406ms (~4.6% faster)
Total vs original baseline: 0.06177ms -> 0.03406ms (1.81x speedup)
Made-with: Cursor