composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Files

Amir Ghamarian 8d396d29f0 Add async prefetch overlap to single-warp-group pipeline

Move next iteration's K/V global loads (K_mem_load, V_mem_load) to
immediately after the barrier, before PV GEMM and K LDS read. This
overlaps the async global->LDS copies with the current iteration's
GEMM compute. Also remove redundant barriers between PV and QK phases
since K/V use separate LDS regions (no read/write conflicts).

Benchmark improvement (64-seq decode, d64 GQA-8):
  Phase 1: 0.03564ms -> Phase 2: 0.03406ms (~4.6% faster)
  Total vs original baseline: 0.06177ms -> 0.03406ms (1.81x speedup)

Made-with: Cursor

2026-03-28 10:47:45 +00:00

Implement batched gemm bias permute for RDNA4 (#3534 )

2026-01-17 08:30:27 +01:00

ck_tile

Add async prefetch overlap to single-warp-group pipeline

2026-03-28 10:47:45 +00:00

rapidjson

Update pre-commit to fixed versions, run remod for ck_tile (#2895 )

2025-10-16 15:29:17 -07:00