mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-30 03:37:38 +00:00
Move next iteration's K/V global loads (K_mem_load, V_mem_load) to immediately after the barrier, before PV GEMM and K LDS read. This overlaps the async global->LDS copies with the current iteration's GEMM compute. Also remove redundant barriers between PV and QK phases since K/V use separate LDS regions (no read/write conflicts). Benchmark improvement (64-seq decode, d64 GQA-8): Phase 1: 0.03564ms -> Phase 2: 0.03406ms (~4.6% faster) Total vs original baseline: 0.06177ms -> 0.03406ms (1.81x speedup) Made-with: Cursor