mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-05 06:01:23 +00:00
CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The existing paged-KV attention pipelines (pagedkv, splitkv) support StreamLLM-style sink tokens — a fixed set of initial tokens kept in attention alongside the sliding window. The `batch_prefill` pipeline (chunked-prefill with VLLM-style block tables) previously hardcoded `kHasSink = false`, making it incompatible with sink-based attention patterns in LLM serving scenarios. This PR extends `batch_prefill` to support `kHasSink` and wires it into `fmha_fwd_runner` for validation against the existing CPU reference. ## Technical Details **Pipeline** (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`): - When `kHasSink`, the K/V loop splits into a sink phase [0, sink_seq_end) and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv. - K advance at the sink→window transition jumps `seqlen_k_start - sink_seq_end + kN0` to bridge the gap. - V scatter-gather offsets are re-initialized at the transition to fix a window mismatch bug: V was lagging kN0 behind K after the large jump, loading from the wrong sequence position. - Bias window, dropout seq_offset, and mask type (LogitsSinkMask) updated for sink-awareness. **Traits / codegen** (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`, `fmha_batch_prefill.py`): - `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded `false`). - Codegen adds `F_sink` field; skips batch-mode kernels (group mode required). - CMake test filter broadened from 9 → 33 instances covering fp16/bf16 × mask/nmask × lse/nlse × sink/nsink. **Runner** (`fmha_fwd_runner.hpp`, `CMakeLists.txt`): - `fmha_batch_prefill()` dispatched from `run_fwd` when: group mode + paged KV + num_splits == 1. - K/V strides corrected for runner's [num_pages, nhead_k, page_block_size, hdim] layout. - `page_block_size % 128` check relaxed: batch_prefill supports ps=16. - CPU reference paged-KV reordering guards extended with `CK_TILE_FMHA_FWD_BATCH_PREFILL_API`. ## Test Plan Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run `tile_example_fmha_fwd` in group mode with page_block_size=16. Test matrix: - Mask: no-mask, causal, sliding window - Sink: nsink, sink=1..128 - dtype: fp16, bf16 - LSE output: on/off - seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024} - GQA, chunked prefill, large batch×seqlen - page_block_size: 16, 32 ## Test Result 171 test cases, all valid:y: - nmask + nsink: ✓ - causal + nsink: ✓ - causal + sink=8: ✓ - sliding window + sink=8 (d=128, d=256): ✓ - bf16, LSE output, GQA: ✓ ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.