Files
composable_kernel/include/ck_tile/ops
Amir Ghamarian 8d396d29f0 Add async prefetch overlap to single-warp-group pipeline
Move next iteration's K/V global loads (K_mem_load, V_mem_load) to
immediately after the barrier, before PV GEMM and K LDS read. This
overlaps the async global->LDS copies with the current iteration's
GEMM compute. Also remove redundant barriers between PV and QK phases
since K/V use separate LDS regions (no read/write conflicts).

Benchmark improvement (64-seq decode, d64 GQA-8):
  Phase 1: 0.03564ms -> Phase 2: 0.03406ms (~4.6% faster)
  Total vs original baseline: 0.06177ms -> 0.03406ms (1.81x speedup)

Made-with: Cursor
2026-03-28 10:47:45 +00:00
..
2026-01-13 09:21:29 -08:00
2026-01-13 09:21:29 -08:00
2026-01-13 09:21:29 -08:00
2026-01-13 09:21:29 -08:00
2026-01-05 13:49:26 -08:00
2025-12-11 13:34:27 +00:00